Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Started by Melanie Plagemanalmost 3 years ago26 messages

melanieplageman@gmail.com

almost 3 years ago

Hi,

I think that 4753ef37e0ed undid the work caf626b2c did to support
sub-millisecond delays for vacuum and autovacuum.

After 4753ef37e0ed, vacuum_delay_point()'s local variable msec is a
double which, after being passed to WaitLatch() as timeout, which is a
long, ends up being 0, so we don't end up waiting AFAICT.

When I set [autovacuum_]vacuum_delay_point to 0.5, SHOW will report that
it is 500us, but WaitLatch() is still getting 0 as timeout.

- Melanie

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Melanie Plageman (#1)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think that 4753ef37e0ed undid the work caf626b2c did to support
sub-millisecond delays for vacuum and autovacuum.

After 4753ef37e0ed, vacuum_delay_point()'s local variable msec is a
double which, after being passed to WaitLatch() as timeout, which is a
long, ends up being 0, so we don't end up waiting AFAICT.

When I set [autovacuum_]vacuum_delay_point to 0.5, SHOW will report that
it is 500us, but WaitLatch() is still getting 0 as timeout.

Given that some of the clunkier underlying kernel primitives have
milliseconds in their interface, I don't think it would be possible to
make a usec-based variant of WaitEventSetWait() that works everywhere.
Could it possibly make sense to do something that accumulates the
error, so if you're using 0.5 then every second vacuum_delay_point()
waits for 1ms?

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Thomas Munro (#2)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think that 4753ef37e0ed undid the work caf626b2c did to support
sub-millisecond delays for vacuum and autovacuum.

Given that some of the clunkier underlying kernel primitives have
milliseconds in their interface, I don't think it would be possible to
make a usec-based variant of WaitEventSetWait() that works everywhere.
Could it possibly make sense to do something that accumulates the
error, so if you're using 0.5 then every second vacuum_delay_point()
waits for 1ms?

Yeah ... using float math there was cute, but it'd only get us so far.
The caf626b2c code would only work well on platforms that have
microsecond-based sleep primitives, so it was already not too portable.

Can we fix this by making VacuumCostBalance carry the extra fractional
delay, or would a separate variable be better?

regards, tom lane

Stephen Frost

sfrost@snowman.net

almost 3 years ago

In reply to: Thomas Munro (#2)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Greetings,

* Thomas Munro (thomas.munro@gmail.com) wrote:

On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think that 4753ef37e0ed undid the work caf626b2c did to support
sub-millisecond delays for vacuum and autovacuum.

After 4753ef37e0ed, vacuum_delay_point()'s local variable msec is a
double which, after being passed to WaitLatch() as timeout, which is a
long, ends up being 0, so we don't end up waiting AFAICT.

When I set [autovacuum_]vacuum_delay_point to 0.5, SHOW will report that
it is 500us, but WaitLatch() is still getting 0 as timeout.

Given that some of the clunkier underlying kernel primitives have
milliseconds in their interface, I don't think it would be possible to
make a usec-based variant of WaitEventSetWait() that works everywhere.
Could it possibly make sense to do something that accumulates the
error, so if you're using 0.5 then every second vacuum_delay_point()
waits for 1ms?

Hmm. That generally makes sense to me.. though isn't exactly the same.
Still, I wouldn't want to go back to purely pg_usleep() as that has the
other downsides mentioned.

Perhaps if the delay is sub-millisecond, explicitly do the WaitLatch()
with zero but also do the pg_usleep()? That's doing a fair bit of work
beyond just sleeping, but it also means we shouldn't miss out on the
postmaster going away or similar..

Thanks,

Stephen

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Tom Lane (#3)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think that 4753ef37e0ed undid the work caf626b2c did to support
sub-millisecond delays for vacuum and autovacuum.

Given that some of the clunkier underlying kernel primitives have
milliseconds in their interface, I don't think it would be possible to
make a usec-based variant of WaitEventSetWait() that works everywhere.
Could it possibly make sense to do something that accumulates the
error, so if you're using 0.5 then every second vacuum_delay_point()
waits for 1ms?

Yeah ... using float math there was cute, but it'd only get us so far.
The caf626b2c code would only work well on platforms that have
microsecond-based sleep primitives, so it was already not too portable.

Also, the previous coding was already b0rked, because pg_usleep()
rounds up to milliseconds on Windows (with a surprising formula for
rounding), and also the whole concept seems to assume things about
schedulers that aren't really universally true. If we actually cared
about high res times maybe we should be using nanosleep and tracking
the drift? And spreading it out a bit. But I don't know.

Can we fix this by making VacuumCostBalance carry the extra fractional
delay, or would a separate variable be better?

I was wondering the same thing, but not being too familiar with that
code, no opinion on that yet.

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#5)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Thu, Mar 9, 2023 at 5:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 10, 2023 at 10:26 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I think that 4753ef37e0ed undid the work caf626b2c did to support
sub-millisecond delays for vacuum and autovacuum.

Given that some of the clunkier underlying kernel primitives have
milliseconds in their interface, I don't think it would be possible to
make a usec-based variant of WaitEventSetWait() that works everywhere.
Could it possibly make sense to do something that accumulates the
error, so if you're using 0.5 then every second vacuum_delay_point()
waits for 1ms?

Yeah ... using float math there was cute, but it'd only get us so far.
The caf626b2c code would only work well on platforms that have
microsecond-based sleep primitives, so it was already not too portable.

Also, the previous coding was already b0rked, because pg_usleep()
rounds up to milliseconds on Windows (with a surprising formula for
rounding), and also the whole concept seems to assume things about
schedulers that aren't really universally true. If we actually cared
about high res times maybe we should be using nanosleep and tracking
the drift? And spreading it out a bit. But I don't know.

Can we fix this by making VacuumCostBalance carry the extra fractional
delay, or would a separate variable be better?

I was wondering the same thing, but not being too familiar with that
code, no opinion on that yet.

Well, that is reset to zero in vacuum() at the top -- which is called for
each table for autovacuum, so it would get reset to zero between
autovacuuming tables. I dunno how you feel about that...

- Melanie

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Thomas Munro (#5)

1 attachment(s)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The caf626b2c code would only work well on platforms that have
microsecond-based sleep primitives, so it was already not too portable.

Also, the previous coding was already b0rked, because pg_usleep()
rounds up to milliseconds on Windows (with a surprising formula for
rounding), and also the whole concept seems to assume things about
schedulers that aren't really universally true. If we actually cared
about high res times maybe we should be using nanosleep and tracking
the drift? And spreading it out a bit. But I don't know.

Yeah, I was wondering about trying to make it a closed-loop control,
but I think that'd be huge overkill considering what the mechanism is
trying to accomplish.

A minimalistic fix could be as attached. I'm not sure if it's worth
making the state variable global so that it can be reset to zero in
the places where we zero out VacuumCostBalance etc. Also note that
this is ignoring the VacuumSharedCostBalance stuff, so you'd possibly
have the extra delay accumulating in unexpected places when there are
multiple workers. But I really doubt it's worth worrying about that.

Is it reasonable to assume that all modern platforms can time
millisecond delays accurately? Ten years ago I'd have suggested
truncating the delay to a multiple of 10msec and using this logic
to track the remainder, but maybe now that's unnecessary.

regards, tom lane

Attachments:

fix-fractional-vacuum-cost-delay-again-wip.patchtext/x-diff; charset=us-ascii; name=fix-fractional-vacuum-cost-delay-again-wip.patchDownload

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..64d3c59709 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2229,14 +2229,30 @@ vacuum_delay_point(void)
 	/* Nap if appropriate */
 	if (msec > 0)
 	{
+		/*
+		 * Since WaitLatch can only wait in units of milliseconds, carry any
+		 * residual fractional msec in a static variable, and accumulate it to
+		 * add into the wait interval next time.
+		 */
+		static double residual_msec = 0;
+		long		lmsec;
+
+		msec += residual_msec;
+
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		(void) WaitLatch(MyLatch,
-						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-						 msec,
-						 WAIT_EVENT_VACUUM_DELAY);
-		ResetLatch(MyLatch);
+		lmsec = floor(msec);
+		residual_msec = msec - lmsec;
+
+		if (lmsec > 0)
+		{
+			(void) WaitLatch(MyLatch,
+							 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							 lmsec,
+							 WAIT_EVENT_VACUUM_DELAY);
+			ResetLatch(MyLatch);
+		}
 
 		VacuumCostBalance = 0;

Nathan Bossart

nathandbossart@gmail.com

almost 3 years ago

In reply to: Tom Lane (#7)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Thu, Mar 09, 2023 at 05:27:08PM -0500, Tom Lane wrote:

Is it reasonable to assume that all modern platforms can time
millisecond delays accurately? Ten years ago I'd have suggested
truncating the delay to a multiple of 10msec and using this logic
to track the remainder, but maybe now that's unnecessary.

If so, it might also be worth updating or removing this comment in
pgsleep.c:

* NOTE: although the delay is specified in microseconds, the effective
* resolution is only 1/HZ, or 10 milliseconds, on most Unixen. Expect
* the requested delay to be rounded up to the next resolution boundary.

I've had doubts for some time about whether this is still accurate...

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Tom Lane (#7)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Thu, Mar 9, 2023 at 5:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The caf626b2c code would only work well on platforms that have
microsecond-based sleep primitives, so it was already not too portable.

Also, the previous coding was already b0rked, because pg_usleep()
rounds up to milliseconds on Windows (with a surprising formula for
rounding), and also the whole concept seems to assume things about
schedulers that aren't really universally true. If we actually cared
about high res times maybe we should be using nanosleep and tracking
the drift? And spreading it out a bit. But I don't know.

Yeah, I was wondering about trying to make it a closed-loop control,
but I think that'd be huge overkill considering what the mechanism is
trying to accomplish.

Not relevant to fixing this, but I wonder if you could eliminate the
need to specify the cost delay in most cases for autovacuum if you used
feedback from how much vacuuming work was done during the last cycle of
vacuuming to control the delay value internally - a kind of
feedback-adjusted controller.

- Melanie

#10

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Tom Lane (#7)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Thu, Mar 9, 2023 at 5:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 10, 2023 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The caf626b2c code would only work well on platforms that have
microsecond-based sleep primitives, so it was already not too portable.

Also, the previous coding was already b0rked, because pg_usleep()
rounds up to milliseconds on Windows (with a surprising formula for
rounding), and also the whole concept seems to assume things about
schedulers that aren't really universally true. If we actually cared
about high res times maybe we should be using nanosleep and tracking
the drift? And spreading it out a bit. But I don't know.

Yeah, I was wondering about trying to make it a closed-loop control,
but I think that'd be huge overkill considering what the mechanism is
trying to accomplish.

A minimalistic fix could be as attached. I'm not sure if it's worth
making the state variable global so that it can be reset to zero in
the places where we zero out VacuumCostBalance etc. Also note that
this is ignoring the VacuumSharedCostBalance stuff, so you'd possibly
have the extra delay accumulating in unexpected places when there are
multiple workers. But I really doubt it's worth worrying about that.

What if someone resets the delay guc and there is still a large residual?

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Melanie Plageman (#10)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Melanie Plageman <melanieplageman@gmail.com> writes:

On Thu, Mar 9, 2023 at 5:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A minimalistic fix could be as attached. I'm not sure if it's worth
making the state variable global so that it can be reset to zero in
the places where we zero out VacuumCostBalance etc. Also note that
this is ignoring the VacuumSharedCostBalance stuff, so you'd possibly
have the extra delay accumulating in unexpected places when there are
multiple workers. But I really doubt it's worth worrying about that.

What if someone resets the delay guc and there is still a large residual?

By definition, the residual is less than 1msec.

regards, tom lane

#12

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Nathan Bossart (#8)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Fri, Mar 10, 2023 at 11:37 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Thu, Mar 09, 2023 at 05:27:08PM -0500, Tom Lane wrote:

Is it reasonable to assume that all modern platforms can time
millisecond delays accurately? Ten years ago I'd have suggested
truncating the delay to a multiple of 10msec and using this logic
to track the remainder, but maybe now that's unnecessary.

If so, it might also be worth updating or removing this comment in
pgsleep.c:

* NOTE: although the delay is specified in microseconds, the effective
* resolution is only 1/HZ, or 10 milliseconds, on most Unixen. Expect
* the requested delay to be rounded up to the next resolution boundary.

I've had doubts for some time about whether this is still accurate...

What I see with the old select(), or a more modern clock_nanosleep()
call, is that Linux, FreeBSD, macOS are happy sleeping for .1ms, .5ms,
1ms, 2ms, 3ms, and through innaccuracies and scheduling overheads etc
it works out to about 5-25% extra sleep time (I expect that can be
affected by choice of time source/available hardware, and perhaps
various system calls use different tricks). I definitely recall the
behaviour described, back in the old days where more stuff was
scheduler-tick based. I have no clue for Windows; quick googling
tells me that it might still be pretty chunky, unless you do certain
other stuff that I didn't follow up; we could probably get more
accurate sleep times by rummaging through nt.dll. It would be good to
find out how well WaitEventSet does on Windows; perhaps we should have
a little timing accuracy test in the tree to collect build farm data?

FWIW epoll has a newer _pwait2() call that has higher res timeout
argument, and Windows WaitEventSet could also do high res timers if
you add timer events rather than using the timeout argument, and I
guess conceptually even the old poll() thing could do the equivalent
with a signal alarm timer, but it sounds a lot like a bad idea to do
very short sleeps to me, burning so much CPU on scheduling. I kinda
wonder if the 10ms + residual thing might even turn out to be a better
idea... but I dunno.

The 1ms residual thing looks pretty good to me as a fix to the
immediate problem report, but we might also want to adjust the wording
in config.sgml?

#13

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#12)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Erm, but maybe I'm just looking at this too myopically. Is there
really any point in letting people set it to 0.5, if it behaves as if
you'd set it to 1 and doubled the cost limit? Isn't it just more
confusing? I haven't read the discussion from when fractional delays
came in, where I imagine that must have come up...

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Thomas Munro (#13)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Thomas Munro <thomas.munro@gmail.com> writes:

Erm, but maybe I'm just looking at this too myopically. Is there
really any point in letting people set it to 0.5, if it behaves as if
you'd set it to 1 and doubled the cost limit? Isn't it just more
confusing? I haven't read the discussion from when fractional delays
came in, where I imagine that must have come up...

At [1]/messages/by-id/28720.1552101086@sss.pgh.pa.us I argued

The reason is this: what we want to do is throttle VACUUM's I/O demand,
and by "throttle" I mean "gradually reduce". There is nothing gradual
about issuing a few million I/Os and then sleeping for many milliseconds;
that'll just produce spikes and valleys in the I/O demand. Ideally,
what we'd have it do is sleep for a very short interval after each I/O.
But that's not too practical, both for code-structure reasons and because
most platforms don't give us a way to so finely control the length of a
sleep. Hence the design of sleeping for awhile after every so many I/Os.

However, the current settings are predicated on the assumption that
you can't get the kernel to give you a sleep of less than circa 10ms.
That assumption is way outdated, I believe; poking around on systems
I have here, the minimum delay time using pg_usleep(1) seems to be
generally less than 100us, and frequently less than 10us, on anything
released in the last decade.

I propose therefore that instead of increasing vacuum_cost_limit,
what we ought to be doing is reducing vacuum_cost_delay by a similar
factor. And, to provide some daylight for people to reduce it even
more, we ought to arrange for it to be specifiable in microseconds
not milliseconds. There's no GUC_UNIT_US right now, but it's time.

That last point was later overruled in favor of keeping it measured in
msec to avoid breaking existing configuration files. Nonetheless,
vacuum_cost_delay *is* an actual time to wait (conceptually at least),
not just part of a unitless ratio; and there seem to be good arguments
in favor of letting people make it small.

I take your point that really short sleeps are inefficient so far as the
scheduling overhead goes. But on modern machines you probably have to get
down to a not-very-large number of microseconds before that's a big deal.

regards, tom lane

[1]: /messages/by-id/28720.1552101086@sss.pgh.pa.us

#15

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Tom Lane (#14)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Fri, Mar 10, 2023 at 1:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I propose therefore that instead of increasing vacuum_cost_limit,
what we ought to be doing is reducing vacuum_cost_delay by a similar
factor. And, to provide some daylight for people to reduce it even
more, we ought to arrange for it to be specifiable in microseconds
not milliseconds. There's no GUC_UNIT_US right now, but it's time.

That last point was later overruled in favor of keeping it measured in
msec to avoid breaking existing configuration files. Nonetheless,
vacuum_cost_delay *is* an actual time to wait (conceptually at least),
not just part of a unitless ratio; and there seem to be good arguments
in favor of letting people make it small.

I take your point that really short sleeps are inefficient so far as the
scheduling overhead goes. But on modern machines you probably have to get
down to a not-very-large number of microseconds before that's a big deal.

OK. One idea is to provide a WaitLatchUsec(), which is just some
cross platform donkeywork that I think I know how to type in, and it
would have to round up on poll() and Windows builds. Then we could
either also provide WaitEventSetResolution() that returns 1000 or 1
depending on availability of 1us waits so that we could round
appropriately and then track residual, but beyond that let the user
worry about inaccuracies and overheads (as mentioned in the
documentation), or we could start consulting the clock and tracking
our actual sleep time and true residual over time (maybe that's what
"closed-loop control" means?).

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Thomas Munro (#15)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Thomas Munro <thomas.munro@gmail.com> writes:

OK. One idea is to provide a WaitLatchUsec(), which is just some
cross platform donkeywork that I think I know how to type in, and it
would have to round up on poll() and Windows builds. Then we could
either also provide WaitEventSetResolution() that returns 1000 or 1
depending on availability of 1us waits so that we could round
appropriately and then track residual, but beyond that let the user
worry about inaccuracies and overheads (as mentioned in the
documentation),

... so we'd still need to have the residual-sleep-time logic?

or we could start consulting the clock and tracking
our actual sleep time and true residual over time (maybe that's what
"closed-loop control" means?).

Yeah, I was hand-waving about trying to measure our actual sleep times.
On reflection I doubt it's a great idea. It'll add overhead and there's
still a question of whether measurement noise would accumulate.

regards, tom lane

#17

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Tom Lane (#16)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Fri, Mar 10, 2023 at 2:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

OK. One idea is to provide a WaitLatchUsec(), which is just some
cross platform donkeywork that I think I know how to type in, and it
would have to round up on poll() and Windows builds. Then we could
either also provide WaitEventSetResolution() that returns 1000 or 1
depending on availability of 1us waits so that we could round
appropriately and then track residual, but beyond that let the user
worry about inaccuracies and overheads (as mentioned in the
documentation),

... so we'd still need to have the residual-sleep-time logic?

Ah, perhaps not. Considering that the historical behaviour on the
main affected platform (Windows) was already to round up to
milliseconds before we latchified this code anyway, and now a google
search is telling me that the relevant timer might in fact be *super*
lumpy, perhaps even to the point of 1/64th of a second[1]https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/ (maybe
that's a problem for a Windows hacker to look into some time; I really
should create a wiki page of known Windows problems in search of a
hacker)... it now looks like sub-ms residual logic would be a bit
pointless after all.

I'll go and see about usec latch waits. More soon.

[1]: https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/

#18

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#17)

4 attachment(s)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Fri, Mar 10, 2023 at 2:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I'll go and see about usec latch waits. More soon.

Here are some experimental patches along those lines. Seems good
locally, but I saw a random failure I don't understand on CI so
apparently I need to find a bug; at least this gives an idea of how
this might look. Unfortunately, the new interface on Linux turned out
to be newer that I first realised: Linux 5.11+ (so RHEL 9, Debian
12/Bookworm, Ubuntu 21.04/Hirsute Hippo), so unless we're OK with it
taking a couple more years to be more widely used, we'll need some
fallback code. Perhaps something like 0004, which also shows the sort
of thing that we might consider back-patching to 14 and 15 (next
revision I'll move that up the front and put it in back-patchable
form). It's not exactly beautiful; maybe sharing code with recovery's
lazy PM-exit detection could help. Of course, the new μs-based wait
API could be used wherever we do timestamp-based waiting, for no
particular reason other than that it is the resolution of our
timestamps, so there is no need to bother rounding; I doubt anyone
would notice or care much about that, but it's vote in favour of μs
rather than the other obvious contender ns, which modern underlying
kernel primitives are using.

Attachments:

0001-Support-microsecond-based-timeouts-in-WaitEventSet-A.patchtext/x-patch; charset=US-ASCII; name=0001-Support-microsecond-based-timeouts-in-WaitEventSet-A.patchDownload

From 5a6b8e348ba18ef0f7ff3e5098b54a690a87c7b0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 10 Mar 2023 15:16:47 +1300
Subject: [PATCH 1/4] Support microsecond based timeouts in WaitEventSet API.

Melanie Plageman discovered that commit 4753ef37e0ed undid the work
caf626b2c did to support sub-millisecond delays for vacuum and
autovacuum.  In order to restore support for vacuum_cost_delay of
fractions of a millisecond, first we'll need higher resolution timeouts
in the WaitEventSet API.

We can do that with modern epoll (since Linux 5.11), and with kqueue,
since both of those work in nanoseconds.  For Windows, we can't, but the
timer resolution was already very low and the documentation for
vacuum_cost_delay warns about that.  For older epoll builds, and
commercial Unixen where we still have to use poll(), we'll internally
round up to whole milliseconds.

Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
---
 configure                       |   2 +-
 configure.ac                    |   1 +
 meson.build                     |   1 +
 src/backend/storage/ipc/latch.c | 145 ++++++++++++++++++++++++--------
 src/include/pg_config.h.in      |   3 +
 src/include/storage/latch.h     |  13 ++-
 src/tools/msvc/Solution.pm      |   1 +
 7 files changed, 127 insertions(+), 39 deletions(-)

diff --git a/configure b/configure
index e35769ea73..914361f91b 100755
--- a/configure
+++ b/configure
@@ -15699,7 +15699,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols copyfile getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols copyfile epoll_pwait2 getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.ac b/configure.ac
index af23c15cb2..4249f8002c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1794,6 +1794,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
 	backtrace_symbols
 	copyfile
+	epoll_pwait2
 	getifaddrs
 	getpeerucred
 	inet_pton
diff --git a/meson.build b/meson.build
index 2409cc2254..29ca956f5e 100644
--- a/meson.build
+++ b/meson.build
@@ -2344,6 +2344,7 @@ func_checks = [
   # when enabling asan the dlopen check doesn't notice that -ldl is actually
   # required. Just checking for dlsym() ought to suffice.
   ['dlsym', {'dependencies': [dl_dep], 'define': false}],
+  ['epoll_pwait2'],
   ['explicit_bzero'],
   ['fdatasync', {'dependencies': [rt_dep, posix4_dep], 'define': false}], # Solaris
   ['getifaddrs'],
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f4123e7de7..5c138ab9bf 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -194,7 +194,7 @@ static void WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event);
 static void WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event);
 #endif
 
-static inline int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+static inline int WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 										WaitEvent *occurred_events, int nevents);
 
 /*
@@ -475,10 +475,9 @@ DisownLatch(Latch *latch)
  * to wait for. If the latch is already set (and WL_LATCH_SET is given), the
  * function returns immediately.
  *
- * The "timeout" is given in milliseconds. It must be >= 0 if WL_TIMEOUT flag
- * is given.  Although it is declared as "long", we don't actually support
- * timeouts longer than INT_MAX milliseconds.  Note that some extra overhead
- * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
+ * The "timeout" is given in microseconds.  It must be >= 0 if WL_TIMEOUT flag
+ * is given.  Note that some extra overhead is incurred when WL_TIMEOUT is
+ * given, so avoid using a timeout if possible.
  *
  * The latch must be owned by the current process, ie. it must be a
  * process-local latch initialized with InitLatch, or a shared latch
@@ -489,8 +488,8 @@ DisownLatch(Latch *latch)
  * we return all of them in one call, but we will return at least one.
  */
 int
-WaitLatch(Latch *latch, int wakeEvents, long timeout,
-		  uint32 wait_event_info)
+WaitLatchUs(Latch *latch, int wakeEvents, int64 timeout_us,
+			uint32 wait_event_info)
 {
 	WaitEvent	event;
 
@@ -510,15 +509,32 @@ WaitLatch(Latch *latch, int wakeEvents, long timeout,
 	LatchWaitSet->exit_on_postmaster_death =
 		((wakeEvents & WL_EXIT_ON_PM_DEATH) != 0);
 
-	if (WaitEventSetWait(LatchWaitSet,
-						 (wakeEvents & WL_TIMEOUT) ? timeout : -1,
-						 &event, 1,
-						 wait_event_info) == 0)
+	if (WaitEventSetWaitUs(LatchWaitSet,
+						   (wakeEvents & WL_TIMEOUT) ? timeout_us : -1,
+						   &event, 1,
+						   wait_event_info) == 0)
 		return WL_TIMEOUT;
 	else
 		return event.events;
 }
 
+/*
+ * Like WaitLatchUs(), but with the timeout in milliseconds.
+ *
+ * The "timeout" is given in milliseconds. It must be >= 0 if WL_TIMEOUT flag
+ * is given.  Although it is declared as "long", we don't actually support
+ * timeouts longer than INT_MAX milliseconds.  Note that some extra overhead
+ * is incurred when WL_TIMEOUT is given, so avoid using a timeout if possible.
+ */
+int
+WaitLatch(Latch *latch, int wakeEvents, long timeout_ms,
+		  uint32 wait_event_info)
+{
+	return WaitLatchUs(latch, wakeEvents,
+					   timeout_ms <= 0 ? timeout_ms : timeout_ms * 1000,
+					   wait_event_info);
+}
+
 /*
  * Like WaitLatch, but with an extra socket argument for WL_SOCKET_*
  * conditions.
@@ -537,8 +553,8 @@ WaitLatch(Latch *latch, int wakeEvents, long timeout,
  * WaitEventSet instead; that's more efficient.
  */
 int
-WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
-				  long timeout, uint32 wait_event_info)
+WaitLatchOrSocketUs(Latch *latch, int wakeEvents, pgsocket sock,
+					int64 timeout_us, uint32 wait_event_info)
 {
 	int			ret = 0;
 	int			rc;
@@ -546,9 +562,9 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
-		Assert(timeout >= 0);
+		Assert(timeout_us >= 0);
 	else
-		timeout = -1;
+		timeout_us = -1;
 
 	if (wakeEvents & WL_LATCH_SET)
 		AddWaitEventToSet(set, WL_LATCH_SET, PGINVALID_SOCKET,
@@ -575,7 +591,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 		AddWaitEventToSet(set, ev, sock, NULL, NULL);
 	}
 
-	rc = WaitEventSetWait(set, timeout, &event, 1, wait_event_info);
+	rc = WaitEventSetWaitUs(set, timeout_us, &event, 1, wait_event_info);
 
 	if (rc == 0)
 		ret |= WL_TIMEOUT;
@@ -591,6 +607,20 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	return ret;
 }
 
+/*
+ * Like WaitLatchOrSocket, but with timeout in milliseconds.
+ */
+int
+WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
+				  long timeout_ms, uint32 wait_event_info)
+{
+	return WaitLatchOrSocketUs(latch,
+							   wakeEvents,
+							   sock,
+							   timeout_ms > 0 ? timeout_ms * 1000 : timeout_ms,
+							   wait_event_info);
+}
+
 /*
  * Sets a latch and wakes up anyone waiting on it.
  *
@@ -1380,14 +1410,14 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
  * values associated with the registered event.
  */
 int
-WaitEventSetWait(WaitEventSet *set, long timeout,
-				 WaitEvent *occurred_events, int nevents,
-				 uint32 wait_event_info)
+WaitEventSetWaitUs(WaitEventSet *set, int64 timeout_us,
+				   WaitEvent *occurred_events, int nevents,
+				   uint32 wait_event_info)
 {
 	int			returned_events = 0;
 	instr_time	start_time;
 	instr_time	cur_time;
-	long		cur_timeout = -1;
+	int64		cur_timeout = -1;
 
 	Assert(nevents > 0);
 
@@ -1395,11 +1425,11 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 	 * Initialize timeout if requested.  We must record the current time so
 	 * that we can determine the remaining timeout if interrupted.
 	 */
-	if (timeout >= 0)
+	if (timeout_us >= 0)
 	{
 		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
+		Assert(timeout_us >= 0);
+		cur_timeout = timeout_us;
 	}
 	else
 		INSTR_TIME_SET_ZERO(start_time);
@@ -1487,11 +1517,11 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 			returned_events = rc;
 
 		/* If we're not done, update cur_timeout for next iteration */
-		if (returned_events == 0 && timeout >= 0)
+		if (returned_events == 0 && timeout_us >= 0)
 		{
 			INSTR_TIME_SET_CURRENT(cur_time);
 			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			cur_timeout = timeout_us - INSTR_TIME_GET_MICROSEC(cur_time);
 			if (cur_timeout <= 0)
 				break;
 		}
@@ -1505,6 +1535,20 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 	return returned_events;
 }
 
+/*
+ * Like WaitEventSetWaitUs(), but the timeout specified in milliseconds.
+ */
+int
+WaitEventSetWait(WaitEventSet *set, long timeout_ms,
+				 WaitEvent *occurred_events, int nevents,
+				 uint32 wait_event_info)
+{
+	return WaitEventSetWaitUs(set,
+							  timeout_ms <= 0 ? timeout_ms : timeout_ms * 1000,
+							  occurred_events,
+							  nevents,
+							  wait_event_info);
+}
 
 #if defined(WAIT_USE_EPOLL)
 
@@ -1517,17 +1561,30 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
  * easy.
  */
 static inline int
-WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 					  WaitEvent *occurred_events, int nevents)
 {
 	int			returned_events = 0;
 	int			rc;
 	WaitEvent  *cur_event;
 	struct epoll_event *cur_epoll_event;
+#ifdef HAVE_EPOLL_PWAIT2
+	struct timespec nap;
+#endif
 
 	/* Sleep */
+#ifdef HAVE_EPOLL_PWAIT2
+	nap.tv_sec = cur_timeout_us / 1000000;
+	nap.tv_nsec = (cur_timeout_us % 1000000) * 1000;
+	rc = epoll_pwait2(set->epoll_fd, set->epoll_ret_events,
+					  Min(nevents, set->nevents_space),
+					  cur_timeout_us >= 0 ? &nap : NULL,
+					  NULL);
+#else
 	rc = epoll_wait(set->epoll_fd, set->epoll_ret_events,
-					Min(nevents, set->nevents_space), cur_timeout);
+					Min(nevents, set->nevents_space),
+					(cur_timeout_us + 999) / 1000);
+#endif
 
 	/* Check return code */
 	if (rc < 0)
@@ -1653,7 +1710,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
  * with separate system calls.
  */
 static int
-WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 					  WaitEvent *occurred_events, int nevents)
 {
 	int			returned_events = 0;
@@ -1663,12 +1720,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	struct timespec timeout;
 	struct timespec *timeout_p;
 
-	if (cur_timeout < 0)
+	if (cur_timeout_us < 0)
 		timeout_p = NULL;
 	else
 	{
-		timeout.tv_sec = cur_timeout / 1000;
-		timeout.tv_nsec = (cur_timeout % 1000) * 1000000;
+		timeout.tv_sec = cur_timeout_us / 1000000;
+		timeout.tv_nsec = (cur_timeout_us % 1000000) * 1000;
 		timeout_p = &timeout;
 	}
 
@@ -1806,16 +1863,25 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
  * but requires iterating through all of set->pollfds.
  */
 static inline int
-WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 					  WaitEvent *occurred_events, int nevents)
 {
 	int			returned_events = 0;
 	int			rc;
 	WaitEvent  *cur_event;
 	struct pollfd *cur_pollfd;
+	int			cur_timeout_ms;
+
+	/* Round up to the nearest millisecond, and cap at INT_MAX. */
+	if (cur_timeout_us >= PG_INT64_MAX - 999)
+		cur_timeout_ms = INT_MAX;
+	else if (cur_timeout_us > 0)
+		cur_timeout_ms = Min((int64) INT_MAX, (cur_timeout_us + 999) / 1000);
+	else
+		cur_timeout_ms = cur_timeout_us;
 
 	/* Sleep */
-	rc = poll(set->pollfds, set->nevents, (int) cur_timeout);
+	rc = poll(set->pollfds, set->nevents, cur_timeout_ms);
 
 	/* Check return code */
 	if (rc < 0)
@@ -1943,12 +2009,21 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
  * that only one event is "consumed".
  */
 static inline int
-WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+WaitEventSetWaitBlock(WaitEventSet *set, int64 cur_timeout_us,
 					  WaitEvent *occurred_events, int nevents)
 {
 	int			returned_events = 0;
 	DWORD		rc;
 	WaitEvent  *cur_event;
+	int			cur_timeout_ms;
+
+	/* Round up to the nearest millisecond, and cap at INT_MAX. */
+	if (cur_timeout_us >= PG_INT64_MAX - 999)
+		cur_timeout_ms = INT_MAX;
+	else if (cur_timeout_us > 0)
+		cur_timeout_ms = Min((int64) INT_MAX, (cur_timeout_us + 999) / 1000);
+	else
+		cur_timeout_ms = cur_timeout_us;
 
 	/* Reset any wait events that need it */
 	for (cur_event = set->events;
@@ -2000,7 +2075,7 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	 * Need to wait for ->nevents + 1, because signal handle is in [0].
 	 */
 	rc = WaitForMultipleObjects(set->nevents + 1, set->handles, FALSE,
-								cur_timeout);
+								cur_timeout_ms);
 
 	/* Check return code */
 	if (rc == WAIT_FAILED)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 20c82f5979..c1f1fc6e70 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -149,6 +149,9 @@
 /* Define to 1 if you have the <editline/readline.h> header file. */
 #undef HAVE_EDITLINE_READLINE_H
 
+/* Define to 1 if you have the `epoll_pwait2' function. */
+#undef HAVE_EPOLL_PWAIT2
+
 /* Define to 1 if you have the <execinfo.h> header file. */
 #undef HAVE_EXECINFO_H
 
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 99cc47874a..756c3114ed 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -180,13 +180,20 @@ extern int	AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 							  Latch *latch, void *user_data);
 extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);
 
-extern int	WaitEventSetWait(WaitEventSet *set, long timeout,
+extern int	WaitEventSetWait(WaitEventSet *set, long timeout_ms,
 							 WaitEvent *occurred_events, int nevents,
 							 uint32 wait_event_info);
-extern int	WaitLatch(Latch *latch, int wakeEvents, long timeout,
+extern int	WaitEventSetWaitUs(WaitEventSet *set, int64 timeout_us,
+							   WaitEvent *occurred_events, int nevents,
+							   uint32 wait_event_info);
+extern int	WaitLatch(Latch *latch, int wakeEvents, long timeout_ms,
 					  uint32 wait_event_info);
+extern int	WaitLatchUs(Latch *latch, int wakeEvents, int64 timeout_us,
+						uint32 wait_event_info);
 extern int	WaitLatchOrSocket(Latch *latch, int wakeEvents,
-							  pgsocket sock, long timeout, uint32 wait_event_info);
+							  pgsocket sock, long timeout_ms, uint32 wait_event_info);
+extern int	WaitLatchOrSocketUs(Latch *latch, int wakeEvents,
+								pgsocket sock, int64 timeout_us, uint32 wait_event_info);
 extern void InitializeLatchWaitSet(void);
 extern int	GetNumRegisteredWaitEvents(WaitEventSet *set);
 extern bool WaitEventSetCanReportClosed(void);
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 5eaea6355e..f88fffa5e2 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -247,6 +247,7 @@ sub GenerateFiles
 		HAVE_DECL_STRNLEN                           => 1,
 		HAVE_EDITLINE_HISTORY_H                     => undef,
 		HAVE_EDITLINE_READLINE_H                    => undef,
+		HAVE_EPOLL_PWAIT2                           => undef,
 		HAVE_EXECINFO_H                             => undef,
 		HAVE_EXPLICIT_BZERO                         => undef,
 		HAVE_FSEEKO                                 => 1,
-- 
2.39.2

0002-Use-microsecond-based-naps-for-vacuum_cost_delay-sle.patchtext/x-patch; charset=US-ASCII; name=0002-Use-microsecond-based-naps-for-vacuum_cost_delay-sle.patchDownload

From 940ea7054eb9fb350acbd58647dc52a4e49629b9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 10 Mar 2023 16:22:59 +1300
Subject: [PATCH 2/4] Use microsecond-based naps for vacuum_cost_delay sleep.

Now that we have microsecond support in the WaitEventSet API, we can fix
the reported problem with the accuracy of vacuum_cost_delay on some
systems.

Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
---
 src/backend/commands/vacuum.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..f379e60dca 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2232,10 +2232,10 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		(void) WaitLatch(MyLatch,
-						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-						 msec,
-						 WAIT_EVENT_VACUUM_DELAY);
+		(void) WaitLatchUs(MyLatch,
+						   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+						   msec * 1000,
+						   WAIT_EVENT_VACUUM_DELAY);
 		ResetLatch(MyLatch);
 
 		VacuumCostBalance = 0;
-- 
2.39.2

0003-Use-microsecond-based-naps-in-walreceiver.patchtext/x-patch; charset=US-ASCII; name=0003-Use-microsecond-based-naps-in-walreceiver.patchDownload

From 6744818fe556f07fe6c004320fa18a15fbee98fd Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 10 Mar 2023 16:19:42 +1300
Subject: [PATCH 3/4] Use microsecond-based naps in walreceiver.

Since anything based on timestamp differences is really in microseconds
under the covers, we might as well use the new higher resolution API for
waiting.
---
 src/backend/replication/walreceiver.c | 16 ++++++++--------
 src/backend/utils/adt/timestamp.c     | 20 ++++++++++++++++++++
 src/include/utils/timestamp.h         |  2 ++
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index f6446da2d6..18c66c0c63 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -445,7 +445,7 @@ WalReceiverMain(void)
 				pgsocket	wait_fd = PGINVALID_SOCKET;
 				int			rc;
 				TimestampTz nextWakeup;
-				long		nap;
+				int64		nap;
 
 				/*
 				 * Exit walreceiver if we're not in recovery. This should not
@@ -530,7 +530,7 @@ WalReceiverMain(void)
 
 				/* Calculate the nap time, clamping as necessary. */
 				now = GetCurrentTimestamp();
-				nap = TimestampDifferenceMilliseconds(now, nextWakeup);
+				nap = TimestampDifferenceMicroseconds(now, nextWakeup);
 
 				/*
 				 * Ideally we would reuse a WaitEventSet object repeatedly
@@ -544,12 +544,12 @@ WalReceiverMain(void)
 				 * avoiding some system calls.
 				 */
 				Assert(wait_fd != PGINVALID_SOCKET);
-				rc = WaitLatchOrSocket(MyLatch,
-									   WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE |
-									   WL_TIMEOUT | WL_LATCH_SET,
-									   wait_fd,
-									   nap,
-									   WAIT_EVENT_WAL_RECEIVER_MAIN);
+				rc = WaitLatchOrSocketUs(MyLatch,
+										 WL_EXIT_ON_PM_DEATH | WL_SOCKET_READABLE |
+										 WL_TIMEOUT | WL_LATCH_SET,
+										 wait_fd,
+										 nap,
+										 WAIT_EVENT_WAL_RECEIVER_MAIN);
 				if (rc & WL_LATCH_SET)
 				{
 					ResetLatch(MyLatch);
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index de93db89d4..52f6568397 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1719,6 +1719,26 @@ TimestampDifferenceMilliseconds(TimestampTz start_time, TimestampTz stop_time)
 		return (long) ((diff + 999) / 1000);
 }
 
+/*
+ * TimestampDifferenceMicroseconds -- convert the difference between two
+ * 		timestamps into microseconds
+ *
+ * Compute a wait time for WaitLatchUs().
+ */
+int64
+TimestampDifferenceMicroseconds(TimestampTz start_time, TimestampTz stop_time)
+{
+	TimestampTz diff;
+
+	/* Deal with zero or negative elapsed time quickly. */
+	if (start_time >= stop_time)
+		return 0;
+	/* To not fail with timestamp infinities, we must detect overflow. */
+	if (pg_sub_s64_overflow(stop_time, start_time, &diff))
+		return PG_INT64_MAX;
+	return diff;
+}
+
 /*
  * TimestampDifferenceExceeds -- report whether the difference between two
  *		timestamps is >= a threshold (expressed in milliseconds)
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index edd59dc432..1caa15221d 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -100,6 +100,8 @@ extern void TimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 								long *secs, int *microsecs);
 extern long TimestampDifferenceMilliseconds(TimestampTz start_time,
 											TimestampTz stop_time);
+extern int64 TimestampDifferenceMicroseconds(TimestampTz start_time,
+											 TimestampTz stop_time);
 extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 									   TimestampTz stop_time,
 									   int msec);
-- 
2.39.2

0004-Provide-fallback-implementation-of-vacuum_cost_delay.patchtext/x-patch; charset=US-ASCII; name=0004-Provide-fallback-implementation-of-vacuum_cost_delay.patchDownload

From db7a1ad2a7857e5b7a3aa679a0769fdc787ba4a1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 10 Mar 2023 17:05:03 +1300
Subject: [PATCH 4/4] Provide fallback implementation of vacuum_cost_delay.

Since some systems don't have support for microsends in the underlying
kernel primitives used by the WaitEventSet API, but we also don't want
to give up postmaster death detection that commit 4753ef37e0ed added, we
need another way forward.  On such systems, re-introduce the pg_usleep()
call, but borrow commit 57dcc2ef's approach: poll for postmaster death
from time to time.

XXX Can we make WaitEventSetHasMicrosecondResolution() into a
compile-time constant?
---
 src/backend/commands/vacuum.c   | 31 ++++++++++++++++++++++++++-----
 src/backend/storage/ipc/latch.c | 15 +++++++++++++++
 src/include/storage/latch.h     |  1 +
 3 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f379e60dca..98fc554152 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -50,6 +50,7 @@
 #include "postmaster/bgworker_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
@@ -2232,11 +2233,31 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		(void) WaitLatchUs(MyLatch,
-						   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-						   msec * 1000,
-						   WAIT_EVENT_VACUUM_DELAY);
-		ResetLatch(MyLatch);
+		if (!WaitEventSetHasMicrosecondResolution())
+		{
+			static uint32 postmaster_poll_count = 0;
+
+			/*
+			 * We need to use a plain old sleep to get microsecond resolution
+			 * on this platform, but we don't want to ignore postmaster death
+			 * during very long vacuums.  Checking involves a system call on
+			 * some systems, so check less frequently.
+			 */
+			if (postmaster_poll_count++ % 1024 == 0 && !PostmasterIsAlive())
+				exit(1);
+
+			pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+			pg_usleep(msec * 1000);
+			pgstat_report_wait_end();
+		}
+		else
+		{
+			(void) WaitLatchUs(MyLatch,
+							   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+							   msec * 1000,
+							   WAIT_EVENT_VACUUM_DELAY);
+			ResetLatch(MyLatch);
+		}
 
 		VacuumCostBalance = 0;
 
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 5c138ab9bf..5b590be9bf 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -2221,6 +2221,21 @@ WaitEventSetCanReportClosed(void)
 #endif
 }
 
+/*
+ * Return whether the current build options can ask the OS to wait with
+ * microsecond resolution.
+ */
+bool
+WaitEventSetHasMicrosecondResolution(void)
+{
+#if	defined(WAIT_USE_KQUEUE) || \
+	(defined(WAIT_USE_EPOLL) && defined(HAVE_EPOLL_PWAIT2))
+	return true;
+#else
+	return false;
+#endif
+}
+
 /*
  * Get the number of wait events registered in a given WaitEventSet.
  */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 756c3114ed..27dc2dec3d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -197,5 +197,6 @@ extern int	WaitLatchOrSocketUs(Latch *latch, int wakeEvents,
 extern void InitializeLatchWaitSet(void);
 extern int	GetNumRegisteredWaitEvents(WaitEventSet *set);
 extern bool WaitEventSetCanReportClosed(void);
+extern bool WaitEventSetHasMicrosecondResolution(void);
 
 #endif							/* LATCH_H */
-- 
2.39.2

#19

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#18)

1 attachment(s)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Fri, Mar 10, 2023 at 6:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... Perhaps something like 0004, which also shows the sort
of thing that we might consider back-patching to 14 and 15 (next
revision I'll move that up the front and put it in back-patchable
form).

I think this is the minimal back-patchable change. I propose to go
ahead and do that, and then to kick the ideas about latch API changes
into a new thread for the next commitfest.

Attachments:

0001-Fix-fractional-vacuum_cost_delay.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-fractional-vacuum_cost_delay.patchDownload

From 1758cf16d22af7c9bae56bc6a40c394ba0107207 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 11 Mar 2023 10:42:20 +1300
Subject: [PATCH] Fix fractional vacuum_cost_delay.

Commit 4753ef37 changed vacuum_delay_point() to use the WaitLatch() API,
to fix the problem that vacuum could keep running for a very long time
after the postmaster died.

Unfortunately, that broke commit caf626b2's support for fractional
vacuum_cost_delay, which shipped in PostgreSQL 12.  WaitLatch() works in
whole milliseconds.

For now, revert the chance from commit 4753ef37, but add an explicit
check for postmaster death.  That's an extra system call on systems
other than Linux and FreeBSD, but that overhead doesn't matter much
considering that we willingly went to sleep and woke up again.  (In
later work, we might add higher resolution timeouts to the latch API so
that we could do this in the standard programming pattern.)

Back-patch to 14, where commit 4753ef37 arrived.

Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
---
 src/backend/commands/vacuum.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..c54360a6a0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -50,6 +50,7 @@
 #include "postmaster/bgworker_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
@@ -2232,11 +2233,18 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;

-		(void) WaitLatch(MyLatch,
-						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-						 msec,
-						 WAIT_EVENT_VACUUM_DELAY);
-		ResetLatch(MyLatch);
+		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+		pg_usleep(msec * 1000);
+		pgstat_report_wait_end();
+
+		/*
+		 * We don't want to ignore postmaster death during very long vacuums
+		 * with vacuum_cost_delay configured.  We can't use the usual
+		 * WaitLatch() approach here because we want microsecond-based sleep
+		 * durations above.
+		 */
+		if (IsUnderPostmaster && !PostmasterIsAlive())
+			exit(1);

 		VacuumCostBalance = 0;

-- 
2.39.2

#20

Tom Lane

tgl@sss.pgh.pa.us

almost 3 years ago

In reply to: Thomas Munro (#19)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

Thomas Munro <thomas.munro@gmail.com> writes:

I think this is the minimal back-patchable change. I propose to go
ahead and do that, and then to kick the ideas about latch API changes
into a new thread for the next commitfest.

OK by me, but then again 4753ef37 wasn't my patch.

regards, tom lane

#21

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#12)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Fri, Mar 10, 2023 at 1:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Mar 10, 2023 at 11:37 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Thu, Mar 09, 2023 at 05:27:08PM -0500, Tom Lane wrote:

Is it reasonable to assume that all modern platforms can time
millisecond delays accurately? Ten years ago I'd have suggested
truncating the delay to a multiple of 10msec and using this logic
to track the remainder, but maybe now that's unnecessary.

If so, it might also be worth updating or removing this comment in
pgsleep.c:

* NOTE: although the delay is specified in microseconds, the effective
* resolution is only 1/HZ, or 10 milliseconds, on most Unixen. Expect
* the requested delay to be rounded up to the next resolution boundary.

I've had doubts for some time about whether this is still accurate...

Unfortunately I was triggered by this Unix archeology discussion, and
wasted some time this weekend testing every Unix we target. I found 3
groups:

1. OpenBSD, NetBSD: Like the comment says, kernel ticks still control
sleep resolution. I measure an average time of ~20ms when I ask for
1ms sleeps in a loop with select() or nanosleep(). I don't actually
understand why it's not ~10ms because HZ is 100 on these systems, but
I didn't look harder.

2. AIX, Solaris, illumos: select() can sleep for 1ms accurately, but
not fractions of 1ms. If you use nanosleep() instead of select(),
then AIX joins the third group (huh, maybe it's just that its
select(us) calls poll(ms) under the covers?), but Solaris does not
(maybe it's still tick-based, but HZ == 1000?).

3. Linux, FreeBSD, macOS: sub-ms sleeps are quite accurate (via
various system calls).

I didn't test Windows but it sounds a lot like it is in group 1 if you
use WaitForMultipleObjects() or SleepEx(), as we do.

You can probably tune some of the above; for example FreeBSD can go
back to the old way with kern.eventtimer.periodic=1 to get a thousand
interrupts per second (kern.hz) instead of programming a hardware
timer to get an interrupt at just the right time, and then 0.5ms sleep
requests get rounded to an average of 1ms, just like on Solaris. And
power usage probably goes up.

As for what do do about it, I dunno, how about this?

  * NOTE: although the delay is specified in microseconds, the effective
- * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
- * the requested delay to be rounded up to the next resolution boundary.
+ * resolution is only 1/HZ on systems that use periodic kernel ticks to limit
+ * sleeping.  This may cause sleeps to be rounded up by as much as 1-20
+ * milliseconds on old Unixen and Windows.

As for the following paragraph about the dangers of select() and
interrupts and restarts, I suspect it is describing the HP-UX
behaviour (a dropped platform), which I guess must have led to POSIX's
reluctance to standardise that properly, but in any case all
hypothetical concerns would disappear if we just used POSIX
[clock_]nanosleep(), no? It has defined behaviour on signals, and it
also tells you the remaining time (if we cared, which we wouldn't).

#22

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Tom Lane (#20)

3 attachment(s)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Sat, Mar 11, 2023 at 11:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

I think this is the minimal back-patchable change. I propose to go
ahead and do that, and then to kick the ideas about latch API changes
into a new thread for the next commitfest.

OK by me, but then again 4753ef37 wasn't my patch.

I'll wait another day to see if Stephen or anyone else who hasn't hit
Monday yet wants to object.

Here also are those other minor tweaks, for master only. I see now
that nanosleep() has already been proposed before:

/messages/by-id/CABQrizfxpBLZT5mZeE0js5oCh1tqEWvcGF3vMRCv5P-RwUY5dQ@mail.gmail.com
/messages/by-id/4902.1552349020@sss.pgh.pa.us

There I see the question of whether it should loop on EINTR to keep
waiting the remaining time. Generally it seems like a job for
something higher level to deal with interruption policy, and of course
all the race condition and portability problems inherent with signals
are fixed by using latches instead, so I don't think there really is a
good answer to that question -- if you loop, you break our programming
rules by wilfully ignoring eg global barriers, but if you don't loop,
it implies you're relying on the interrupt to cause you to do
something and yet you might have missed it if it was delivered just
before the syscall. At the time of the earlier thread, maybe it was
more acceptable as it could only delay cancel for that backend, but
now it might even delay arbitrary other backends, and neither answer
to that question can fix that in a race-free way. Also, back then
latches had a SIGUSR1 handler on common systems, but now they don't,
so (racy unreliable) latch responsiveness has decreased since then.
So I think we should just leave the interface as it is, and build
better things and then eventually retire it. This general topic is
also currently being discussed at:

/messages/by-id/20230209205929.GA720594@nathanxps13

I propose to go ahead and make this small improvement anyway because
it'll surely be a while before we delete the last pg_usleep() call,
and it's good to spring-clean old confusing commentary about signals
and portability.

Attachments:

0001-Fix-fractional-vacuum_cost_delay.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-fractional-vacuum_cost_delay.patchDownload

From 03c7bd1a8df9f3bd3b6b00ad86c8ca734b45f9bd Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 11 Mar 2023 10:42:20 +1300
Subject: [PATCH 1/3] Fix fractional vacuum_cost_delay.

Commit 4753ef37 changed vacuum_delay_point() to use the WaitLatch() API,
to fix the problem that vacuum could keep running for a very long time
after the postmaster died.

Unfortunately, that broke commit caf626b2's support for fractional
vacuum_cost_delay, which shipped in PostgreSQL 12.  WaitLatch() works in
whole milliseconds.

For now, revert the change from commit 4753ef37, but add an explicit
check for postmaster death.  That's an extra system call on systems
other than Linux and FreeBSD, but that overhead doesn't matter much
considering that we willingly went to sleep and woke up again.  (In
later work, we might add higher resolution timeouts to the latch API so
that we could do this in the standard programming pattern, but that
wouldn't be back-patched.)

Back-patch to 14, where commit 4753ef37 arrived.

Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_b-q0hXCBUCAATh0Z4Zi6UkiC0k2DFgoD3nC-r3SkR3tg%40mail.gmail.com
---
 src/backend/commands/vacuum.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..c54360a6a0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -50,6 +50,7 @@
 #include "postmaster/bgworker_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/acl.h"
@@ -2232,11 +2233,18 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;

-		(void) WaitLatch(MyLatch,
-						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-						 msec,
-						 WAIT_EVENT_VACUUM_DELAY);
-		ResetLatch(MyLatch);
+		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
+		pg_usleep(msec * 1000);
+		pgstat_report_wait_end();
+
+		/*
+		 * We don't want to ignore postmaster death during very long vacuums
+		 * with vacuum_cost_delay configured.  We can't use the usual
+		 * WaitLatch() approach here because we want microsecond-based sleep
+		 * durations above.
+		 */
+		if (IsUnderPostmaster && !PostmasterIsAlive())
+			exit(1);

 		VacuumCostBalance = 0;

-- 
2.39.2

0002-Update-obsolete-comment-about-pg_usleep-accuracy.patchtext/x-patch; charset=US-ASCII; name=0002-Update-obsolete-comment-about-pg_usleep-accuracy.patchDownload

From f0c3b6506e040933d2466ee4061f29a9a946a155 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 13 Mar 2023 09:50:06 +1300
Subject: [PATCH 2/3] Update obsolete comment about pg_usleep() accuracy.

There are still some systems that use traditional tick or jiffy-based
sleep timing, but many including Linux (see "man 7 time"), FreeBSD and
macOS are limited only by the accuracy of the available timer hardware
and system load.  Update our comment about that, which previously said
that most Unix systems had that design.  Also mention that Windows is
like the older Unixen in this respect.

Reported-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BogAon8_V223Ldv6taPR2uKH3X_UJ_A7LJAf3-VRARPA%40mail.gmail.com
---
 src/port/pgsleep.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/port/pgsleep.c b/src/port/pgsleep.c
index 8a709cd01d..dcba7eda44 100644
--- a/src/port/pgsleep.c
+++ b/src/port/pgsleep.c
@@ -26,8 +26,9 @@
  * pg_usleep --- delay the specified number of microseconds.
  *
  * NOTE: although the delay is specified in microseconds, the effective
- * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
- * the requested delay to be rounded up to the next resolution boundary.
+ * resolution is only 1/HZ on systems that use periodic kernel ticks to wake
+ * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on older
+ * Unixen and Windows.
  *
  * On machines where "long" is 32 bits, the maximum delay is ~2000 seconds.
  *
-- 
2.39.2

0003-Use-nanosleep-to-implement-pg_usleep.patchtext/x-patch; charset=US-ASCII; name=0003-Use-nanosleep-to-implement-pg_usleep.patchDownload

From 1f0bd35300dd0c669d4bbf1f50857f17ae64b08c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 13 Mar 2023 10:45:14 +1300
Subject: [PATCH 3/3] Use nanosleep() to implement pg_usleep().

The previous coding based on select() required a lot of commentary about
historical portability concerns.  We can just get rid of all that and
use the POSIX nanosleep() function.  The newer clock_nanosleep() variant
might be better, because it can take a CLOCK_MONOTONIC argument to avoid
the chance that ntpd adjustments affect sleep times, but macOS hasn't
caught up with that yet.
---
 src/port/pgsleep.c | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/src/port/pgsleep.c b/src/port/pgsleep.c
index dcba7eda44..3f42021596 100644
--- a/src/port/pgsleep.c
+++ b/src/port/pgsleep.c
@@ -12,9 +12,7 @@
  */
 #include "c.h"
 
-#include <unistd.h>
-#include <sys/select.h>
-#include <sys/time.h>
+#include <time.h>
 
 /*
  * In a Windows backend, we don't use this implementation, but rather
@@ -32,15 +30,11 @@
  *
  * On machines where "long" is 32 bits, the maximum delay is ~2000 seconds.
  *
- * CAUTION: the behavior when a signal arrives during the sleep is platform
- * dependent.  On most Unix-ish platforms, a signal does not terminate the
- * sleep; but on some, it will (the Windows implementation also allows signals
- * to terminate pg_usleep).  And there are platforms where not only does a
- * signal not terminate the sleep, but it actually resets the timeout counter
- * so that the sleep effectively starts over!  It is therefore rather hazardous
- * to use this for long sleeps; a continuing stream of signal events could
- * prevent the sleep from ever terminating.  Better practice for long sleeps
- * is to use WaitLatch() with a timeout.
+ * CAUTION: if interrupted by a signal, this function will return, but its
+ * interface doesn't report that.  It's not a good idea to use this
+ * for long sleeps in the backend, because backends are expected to respond to
+ * interrupts promptly.  Better practice for long sleeps is to use WaitLatch()
+ * with a timeout.
  */
 void
 pg_usleep(long microsec)
@@ -48,11 +42,11 @@ pg_usleep(long microsec)
 	if (microsec > 0)
 	{
 #ifndef WIN32
-		struct timeval delay;
+		struct timespec delay;
 
 		delay.tv_sec = microsec / 1000000L;
-		delay.tv_usec = microsec % 1000000L;
-		(void) select(0, NULL, NULL, NULL, &delay);
+		delay.tv_nsec = (microsec % 1000000L) * 1000;
+		(void) nanosleep(&delay, NULL);
 #else
 		SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE);
 #endif
-- 
2.39.2

#23

Nathan Bossart

nathandbossart@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#22)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

* NOTE: although the delay is specified in microseconds, the effective
- * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
- * the requested delay to be rounded up to the next resolution boundary.
+ * resolution is only 1/HZ on systems that use periodic kernel ticks to wake
+ * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on older
+ * Unixen and Windows.

nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
Otherwise, I think this is the right idea.

+ * CAUTION: if interrupted by a signal, this function will return, but its
+ * interface doesn't report that.  It's not a good idea to use this
+ * for long sleeps in the backend, because backends are expected to respond to
+ * interrupts promptly.  Better practice for long sleeps is to use WaitLatch()
+ * with a timeout.

I'm not sure this argument follows. If pg_usleep() returns if interrupted,
then why are we concerned about delayed responses to interrupts?

-		delay.tv_usec = microsec % 1000000L;
-		(void) select(0, NULL, NULL, NULL, &delay);
+		delay.tv_nsec = (microsec % 1000000L) * 1000;
+		(void) nanosleep(&delay, NULL);

Using nanosleep() seems reasonable to me.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#24

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Nathan Bossart (#23)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Tue, Mar 14, 2023 at 12:10 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

* NOTE: although the delay is specified in microseconds, the effective
- * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
- * the requested delay to be rounded up to the next resolution boundary.
+ * resolution is only 1/HZ on systems that use periodic kernel ticks to wake
+ * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on older
+ * Unixen and Windows.

nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
Otherwise, I think this is the right idea.

Better words welcome; 1-20ms summarises the range I actually measured,
and if reports are correct about Windows' HZ=64 (1/HZ = 15.625ms) then
it neatly covers that too, so I don't feel too bad about not chasing
down the reason for that 10ms/20ms discrepancy; maybe I looked at the
wrong HZ number (which you can change, anyway), I'm not too used to
NetBSD... BTW they have a project plan to fix that
https://wiki.netbsd.org/projects/project/tickless/

+ * CAUTION: if interrupted by a signal, this function will return, but its
+ * interface doesn't report that.  It's not a good idea to use this
+ * for long sleeps in the backend, because backends are expected to respond to
+ * interrupts promptly.  Better practice for long sleeps is to use WaitLatch()
+ * with a timeout.
I'm not sure this argument follows. If pg_usleep() returns if interrupted,
then why are we concerned about delayed responses to interrupts?

Because you can't rely on it:

1. Maybe the signal is delivered just before pg_usleep() begins, and
a handler sets some flag we would like to react to. Now pg_usleep()
will not be interrupted. That problem is solved by using latches
instead.
2. Maybe the signal is one that is no longer handled by a handler at
all; these days, latches use SIGURG, which pops out when you read a
signalfd or kqueue, so pg_usleep() will not wake up. That problem is
solved by using latches instead.

(The word "interrupt" is a bit overloaded, which doesn't help with
this discussion.)

-             delay.tv_usec = microsec % 1000000L;
-             (void) select(0, NULL, NULL, NULL, &delay);
+             delay.tv_nsec = (microsec % 1000000L) * 1000;
+             (void) nanosleep(&delay, NULL);

Using nanosleep() seems reasonable to me.

Thanks for looking!

#25

Nathan Bossart

nathandbossart@gmail.com

almost 3 years ago

In reply to: Thomas Munro (#24)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Tue, Mar 14, 2023 at 03:38:45PM +1300, Thomas Munro wrote:

On Tue, Mar 14, 2023 at 12:10 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
* NOTE: although the delay is specified in microseconds, the effective
- * resolution is only 1/HZ, or 10 milliseconds, on most Unixen.  Expect
- * the requested delay to be rounded up to the next resolution boundary.
+ * resolution is only 1/HZ on systems that use periodic kernel ticks to wake
+ * up.  This may cause sleeps to be rounded up by 1-20 milliseconds on older
+ * Unixen and Windows.
nitpick: Could the 1/HZ versus 20 milliseconds discrepancy cause confusion?
Otherwise, I think this is the right idea.
Better words welcome; 1-20ms summarises the range I actually measured,
and if reports are correct about Windows' HZ=64 (1/HZ = 15.625ms) then
it neatly covers that too, so I don't feel too bad about not chasing
down the reason for that 10ms/20ms discrepancy; maybe I looked at the
wrong HZ number (which you can change, anyway), I'm not too used to
NetBSD... BTW they have a project plan to fix that
https://wiki.netbsd.org/projects/project/tickless/

Here is roughly what I had in mind:

NOTE: Although the delay is specified in microseconds, older Unixen and
Windows use periodic kernel ticks to wake up, which might increase the
delay time significantly. We've observed delay increases as large as
20 milliseconds on supported platforms.

+ * CAUTION: if interrupted by a signal, this function will return, but its
+ * interface doesn't report that.  It's not a good idea to use this
+ * for long sleeps in the backend, because backends are expected to respond to
+ * interrupts promptly.  Better practice for long sleeps is to use WaitLatch()
+ * with a timeout.
I'm not sure this argument follows. If pg_usleep() returns if interrupted,
then why are we concerned about delayed responses to interrupts?
Because you can't rely on it:

1. Maybe the signal is delivered just before pg_usleep() begins, and
a handler sets some flag we would like to react to. Now pg_usleep()
will not be interrupted. That problem is solved by using latches
instead.
2. Maybe the signal is one that is no longer handled by a handler at
all; these days, latches use SIGURG, which pops out when you read a
signalfd or kqueue, so pg_usleep() will not wake up. That problem is
solved by using latches instead.

(The word "interrupt" is a bit overloaded, which doesn't help with
this discussion.)

Yeah, I think it would be clearer if "interrupt" was disambiguated.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#26

Thomas Munro

thomas.munro@gmail.com

almost 3 years ago

In reply to: Nathan Bossart (#25)

Re: Sub-millisecond [autovacuum_]vacuum_cost_delay broken

On Wed, Mar 15, 2023 at 7:54 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Here is roughly what I had in mind:

NOTE: Although the delay is specified in microseconds, older Unixen and
Windows use periodic kernel ticks to wake up, which might increase the
delay time significantly. We've observed delay increases as large as
20 milliseconds on supported platforms.

Sold. And pushed.

I couldn't let that 20ms != 1s/100 problem go, despite my claim that I
would, and now I see: NetBSD does have 10ms resolution, so everyone
can relax, arithmetic still works. It's just that it always or often
adds on one extra tick, for some strange reason. So you can measure
20ms, 30ms, ... but never as low as 10ms. *Shrug*. Your description
covered that nicely.

https://marc.info/?l=netbsd-current-users&m=144832117108168&w=2

(The word "interrupt" is a bit overloaded, which doesn't help with
this discussion.)

Yeah, I think it would be clearer if "interrupt" was disambiguated.

OK, I rewrote it to avoid that terminology.

On small detail, after reading Tom's 2019 proposal to do this[1]/messages/by-id/4902.1552349020@sss.pgh.pa.us: He
mentioned SUSv2's ENOSYS error. I see that SUSv3 (POSIX.1-2001)
dropped that. Systems that don't have the "timers" option simply
shouldn't define the function, but we already require the "timers"
option for clock_gettime(). And more practically, I know that all our
target systems have it and it works.

Pushed.

[1]: /messages/by-id/4902.1552349020@sss.pgh.pa.us