Reducing power consumption on idle servers
Some years ago we did a pass through the various worker processes to
add hibernation as a mechanism to reduce power consumption on an idle
server. Replication never got the memo, so power consumption on an
idle server is not very effective on standby or logical subscribers.
The code and timing for hibernation is also different for each worker,
which is confusing.
Proposal is to improve the situation and reduce power consumption on
an idle server, with a series of fairly light changes to give positive
green/environmental benefit.
CURRENT STATE
These servers naturally sleep for long periods when inactive:
* postmaster - 60s
* checkpointer - checkpoint_timeout
* syslogger - log_rotation_age
* pgarch - 60s
* autovac launcher - autovacuum_naptime
* walsender - wal_sender_timeout /2
* contrib/prewarm - checkpoint_timeout or shutdown
* pgstat - except on windows, see later
The following servers all have some kind of hibernation modes:
* bgwriter - 50 * bgwriter_delay
* walwriter. - 25 * wal_writer_delay
These servers don't try to hibernate at all:
* logical worker - 1s
* logical launcher - wal_retrieve_retry_interval (undocumented)
* startup - hardcoded 5s when streaming, wal_receiver_retry_interval
for WAL files
* wal_receiver - 100ms, currently gets woken when WAL arrives
* pgstat - 2s (on Windows only)
PROPOSED CHANGES
1. Standardize the hibernation time at 60s, using a #define
HIBERNATE_DELAY_SEC 60
2. Refactor postmaster and pgarch to use that setting, rather than hardcoded 60
3. Standardize the hibernation design pattern through a set of macros
in latch.h,
based on the coding of walwriter.c, since that was the best
example of hibernation code.
Hibernation is independent for each process.
This is explained in the header for latch.h, with an example in
src/test/modules/worker_spi/worker_spi.c
The intention here is to provide simple facilities to allow authors
of bg workers to add hibernation code also.
Summary: after 50 idle loops, hibernate for 60s each time through the loop
In all cases, switch immediately back into action when needed.
4. Change these processes to hibernate using the standard design pattern
* bgwriter - currently gets woken when user allocates a bugger, no
change proposed
* walwriter - currently gets woken by XLogFlush, no change proposed
5. Startup process has a hardcoded 5s loop because it checks for
trigger file to promote it. So hibernating would mean that it would
promote more slowly, and/or restart failing walreceiver more slowly,
so this requires user approval, and hence add new GUCs to approve that
choice. This is a valid choice because a long-term idle server is
obviously not in current use, so waiting 60s for failover or restart
is very unlikely to cause significant issue. Startup process is woken
by WALReceiver when WAL arrives, so if all is well, Startup will swing
back into action quickly when needed.
If standby_can_hibernate = true, then these processes can hibernate:
* startup
* walreceiver - hibernate, but only for wal_receiver_timeout/2, not
60s, so other existing behavior is preserved.
If subscription_can_hibernate = true, then these processes can hibernate:
* logical launcher
* logical worker - hibernate, but only for wal_receiver_timeout/2, not
60s, so other existing behavior is preserved.
Patch to do all of the above attached.
Comments please.
--
Simon Riggs http://www.EnterpriseDB.com/
Attachments:
hibernate_to_reduce_power_consumption.v1.patchapplication/octet-stream; name=hibernate_to_reduce_power_consumption.v1.patchDownload+225-63
Hi,
On 2022-02-19 14:10:39 +0000, Simon Riggs wrote:
Some years ago we did a pass through the various worker processes to
add hibernation as a mechanism to reduce power consumption on an idle
server. Replication never got the memo, so power consumption on an
idle server is not very effective on standby or logical subscribers.
The code and timing for hibernation is also different for each worker,
which is confusing.
Yea, I think it's high time that we fix this.
It's not even just power consumption:
- the short timeouts hide all kinds of bugs around missed wakeups / racy
wakeup sequences
- debugging problems gets harder if there's lots of frequent activity
CURRENT STATE
These servers naturally sleep for long periods when inactive:
* postmaster - 60s
* checkpointer - checkpoint_timeout
* syslogger - log_rotation_age
* pgarch - 60s
Why do we need *any* timeout here? It seems one of those bug/race hiding
things? Imo we should only use a timeout when a prior archive failed and rely
on wakeups otherwise.
* autovac launcher - autovacuum_naptime
On production systems autovacuum_naptime often can't be a large value,
otherwise it's easy to not keep up on small busy tables. That's fine for
actually busy servers, but with the increase in hosted PG offerings, the
defaults in those offerings needs to cater to a broader audience.
These servers don't try to hibernate at all:
* logical worker - 1s
Not great.
* logical launcher - wal_retrieve_retry_interval (undocumented)
I think it's actually 180s in the happy path. The wal_retrieve_retry_interval
is about how often workers get restarted. But if there's no need to restart,
it sleeps longer.
* startup - hardcoded 5s when streaming, wal_receiver_retry_interval
for WAL files
* wal_receiver - 100ms, currently gets woken when WAL arrives
This is a fairly insane one. We should compute a precise timeout based on
wal_receiver_timeout.
And it's not just one syscall every 100ms, it's
recvfrom(4, 0x7fd66134b960, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_create1(EPOLL_CLOEXEC) = 6
epoll_ctl(6, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1630593560, u64=140558730322456}}) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1630593584, u64=140558730322480}}) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1630593608, u64=140558730322504}}) = 0
epoll_wait(6, [], 1, 100) = 0
close(6) = 0
PROPOSED CHANGES
1. Standardize the hibernation time at 60s, using a #define
HIBERNATE_DELAY_SEC 60
I don't think the hibernation stuff is a great pattern. When hibernation was
introduced we neither had latches nor condition variables, so we couldn't
easily do better. Today we have, so we should do better.
We should either not use timeouts at all (e.g. pgarch without a preceding
failure) and rely on being woken up when new work arrives.
Or use precisely calculated timeouts (e.g. NOW() - last_recv_timestamp +
wal_receiver_timeout) when there's a good reason to wake up (like needing to
send a status message).
IMO we should actually consider moving *away* from hibernation for the cases
where we currently use it and rely much more heavily on being notified when
there's work to do, without a timeout when not.
5. Startup process has a hardcoded 5s loop because it checks for
trigger file to promote it. So hibernating would mean that it would
promote more slowly, and/or restart failing walreceiver more slowly,
so this requires user approval, and hence add new GUCs to approve that
choice. This is a valid choice because a long-term idle server is
obviously not in current use, so waiting 60s for failover or restart
is very unlikely to cause significant issue.
There's plenty of databases that are close to read-only but business critical,
so I don't buy that argument.
IMO we should instead consider either deprecating file based promotion, or
adding an optional dependency on filesystem monitoring APIs (i.e. inotify etc)
that avoid the need to poll for file creation.
Greetings,
Andres Freund
On Sun, Feb 20, 2022 at 6:03 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-02-19 14:10:39 +0000, Simon Riggs wrote:
* wal_receiver - 100ms, currently gets woken when WAL arrives
This is a fairly insane one. We should compute a precise timeout based on
wal_receiver_timeout.
I proposed a patch to do that here: https://commitfest.postgresql.org/37/3520/
It needs a couple more tweaks (I realised it needs to round
microsecond sleep times up to the nearest whole millisecond,
explaining a few spurious wakeups, and Horiguchi-san had some more
feedback for me that I haven't got to yet), but it seems close.
And it's not just one syscall every 100ms, it's
recvfrom(4, 0x7fd66134b960, 16384, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
epoll_create1(EPOLL_CLOEXEC) = 6
epoll_ctl(6, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1630593560, u64=140558730322456}}) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1630593584, u64=140558730322480}}) = 0
epoll_ctl(6, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=1630593608, u64=140558730322504}}) = 0
epoll_wait(6, [], 1, 100) = 0
close(6) = 0
Yeah, I have a patch for that (no CF entry yet, will create shortly),
and together these two patches get walreceiver's wait loop down to a
single epoll_wait() call (or local equivalent) that waits the exact
amount of time required to perform the next periodic action.
5. Startup process has a hardcoded 5s loop because it checks for
trigger file to promote it. So hibernating would mean that it would
promote more slowly, and/or restart failing walreceiver more slowly,
so this requires user approval, and hence add new GUCs to approve that
choice. This is a valid choice because a long-term idle server is
obviously not in current use, so waiting 60s for failover or restart
is very unlikely to cause significant issue.There's plenty of databases that are close to read-only but business critical,
so I don't buy that argument.IMO we should instead consider either deprecating file based promotion, or
adding an optional dependency on filesystem monitoring APIs (i.e. inotify etc)
that avoid the need to poll for file creation.
Yeah, I pondered inotify/KQ_FILTER_VNODE/FindFirstChangeNotification
for this while working on 600f2f50. I realised I could probably teach
a WaitEventSet to wake up when a file is added on most OSes, but I
didn't try to prototype it... I agree that the whole file
system-watching concept feels pretty clunky, so if just getting rid of
it is an option...
It would be nice to tame the walwriter's wakeups...
On Sat, 19 Feb 2022 at 17:03, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-02-19 14:10:39 +0000, Simon Riggs wrote:
Some years ago we did a pass through the various worker processes to
add hibernation as a mechanism to reduce power consumption on an idle
server. Replication never got the memo, so power consumption on an
idle server is not very effective on standby or logical subscribers.
The code and timing for hibernation is also different for each worker,
which is confusing.Yea, I think it's high time that we fix this.
Good to have your support. There are millions of servers running idle
for some part of their duty cycle.
This patch seeks to change the situation for the better in PG15, i.e.
soon, so the changes proposed are deliberately light. It also seeks to
provide a framework that writers of background worker processes can
follow, since we can't just fix core, we need to fix all the various
bgworkers in use as well.
IMO we should instead consider either deprecating file based promotion, or
adding an optional dependency on filesystem monitoring APIs (i.e. inotify etc)
that avoid the need to poll for file creation.
Deprecating explicit file-based promotion is possible and simple, so
that is the approach in the latest version of the patch.
Thanks for the suggestion.
I've re-analyzed all of the code around startup/walreceiver
interactions and there isn't any need for 5s delay on startup process,
IMHO, so we can sleep longer, for startup process.
IMO we should actually consider moving *away* from hibernation for the cases
where we currently use it and rely much more heavily on being notified when
there's work to do, without a timeout when not.
I don't think that is a practical approach this close to end of PG15.
This would require changing behavior of bgwriter, walwriter, pgarch
when they are not broken. The likelihood that we would break something
is too high.
What is an issue is that the sleep times of various procs are on
completely different cycles, which is why I am proposing normalizing
them so that Postgres can actually sleep effectively.
* autovac launcher - autovacuum_naptime
On production systems autovacuum_naptime often can't be a large value,
otherwise it's easy to not keep up on small busy tables. That's fine for
actually busy servers, but with the increase in hosted PG offerings, the
defaults in those offerings needs to cater to a broader audience.
Autovac varies its wakeup cycle according to how much work is done. It
is OK to set autovacuum_naptime without affecting power consumption
when idle.
Idle for autovac is defined slightly differently, since if all user
work completes then there may still be a lot of vacuuming to do before
it goes fully idle. But my observation is that there are many servers
that go idle for more than 50% of each week, when operating 8-12 hours
per day, 5 days per week, so we can still save a lot of power.
This patch doesn't change how autovac works, it just uses a common
setting for the hibernation that eventually occurs.
These servers don't try to hibernate at all:
* logical worker - 1sNot great.
Agreed, the patch improves this, roughly same as walreceiver.
* logical launcher - wal_retrieve_retry_interval (undocumented)
I think it's actually 180s in the happy path. The wal_retrieve_retry_interval
is about how often workers get restarted. But if there's no need to restart,
it sleeps longer.
I propose normalizing all of the various hibernation times to the same value.
* wal_receiver - 100ms, currently gets woken when WAL arrives
This is a fairly insane one. We should compute a precise timeout based on
wal_receiver_timeout.
That is exactly what the patch does, when it hibernates.
I wasn't aware of Thomas' work, but now that I am we can choose which
of those approaches to use for WAL receiver. I hope that we can fix
logical worker and wal receiver to use the same algorithm. The rest of
this patch would still be valid, whatever we do for those two procs.
--
Simon Riggs http://www.EnterpriseDB.com/
Attachments:
hibernate_to_reduce_power_consumption.v2.patchapplication/octet-stream; name=hibernate_to_reduce_power_consumption.v2.patchDownload+128-75
Hi,
On 02/21/22 11:11, Simon Riggs wrote:
This patch seeks to change the situation for the better in PG15, i.e.
soon, so the changes proposed are deliberately light. It also seeks to
provide a framework that writers of background worker processes can
follow, since we can't just fix core, we need to fix all the various
bgworkers in use as well.
I think there might be a typo in the worker_spi.c example:
+ /*
+ * Use the standard design pattern for wait time/hibernation.
+ * After 50 consecutive loops with work_done=true the wait time
+ * will be set to the standard hibernation timeout of 60s.
+ */
+ SET_DELAY_OR_HIBERNATE(work_done, worker_spi_naptime * 1000L);
Shouldn't the comment be "with work_done=false" ?
Regards,
-Chap
On Mon, 21 Feb 2022 at 16:49, Chapman Flack <chap@anastigmatix.net> wrote:
Shouldn't the comment be "with work_done=false" ?
Good catch, thanks.
I've also added docs to say that "promote_trigger_file" is now
deprecated. There were no tests for that functionality, so just as
well it is being removed.
v3 attached.
--
Simon Riggs http://www.EnterpriseDB.com/
Attachments:
hibernate_to_reduce_power_consumption.v3.patchapplication/octet-stream; name=hibernate_to_reduce_power_consumption.v3.patchDownload+135-87
On Mon, Feb 21, 2022 at 5:11 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
On Sat, 19 Feb 2022 at 17:03, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2022-02-19 14:10:39 +0000, Simon Riggs wrote:
IMO we should instead consider either deprecating file based promotion, or
adding an optional dependency on filesystem monitoring APIs (i.e. inotify etc)
that avoid the need to poll for file creation.
Came here to suggest exactly that :)
Deprecating explicit file-based promotion is possible and simple, so
that is the approach in the latest version of the patch.
Is there any actual use-case for this other than backwards
compatibility? The alternative has certainly been around for some time
now, so if we don't know a specific use-case for the file-based one
it's definitely time to deprecate it properly.
--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes:
Deprecating explicit file-based promotion is possible and simple, so
that is the approach in the latest version of the patch.
Is there any actual use-case for this other than backwards
compatibility?
The fundamental problem with signal-based promotion is that it's
an edge-triggered rather than level-triggered condition, ie if you
miss the single notification event you're screwed. I'm not sure
why we'd want to go there, considering all the problems we're having
with edge-triggered APIs in nearby threads.
We could combine the two approaches, perhaps: have "pg_ctl promote"
create a trigger file and then send a signal. The trigger file
would record the existence of the promotion request, so that if the
postmaster isn't running right now or misses the signal, it'd still
promote when restarted or otherwise at the next opportunity.
But with an approach like this, we could expect quick response to
the signal in normal conditions, without need for constant wakeups
to check for the file. Also, this route would allow overloading
of the signal, since it would just mean "wake up, I asked you to
do something" rather than needing to carry the full semantics of
what is to be done.
regards, tom lane
On 2/21/22 10:11 AM, Simon Riggs wrote:
* autovac launcher - autovacuum_naptime
On production systems autovacuum_naptime often can't be a large value,
otherwise it's easy to not keep up on small busy tables. That's fine for
actually busy servers, but with the increase in hosted PG offerings, the
defaults in those offerings needs to cater to a broader audience.Autovac varies its wakeup cycle according to how much work is done. It
is OK to set autovacuum_naptime without affecting power consumption
when idle.
I'm wondering how many people understand that. I've seen a number of
servers running very low values of autovacuum_naptime in order to make
things more responsive.
Idle for autovac is defined slightly differently, since if all user
work completes then there may still be a lot of vacuuming to do before
it goes fully idle. But my observation is that there are many servers
that go idle for more than 50% of each week, when operating 8-12 hours
per day, 5 days per week, so we can still save a lot of power.This patch doesn't change how autovac works, it just uses a common
setting for the hibernation that eventually occurs.
I'm wondering if it'd be worth linking autovac wakeup from a truly idle
state to the stats collector. If there's no stats messages coming in
clearly there's nothing new for autovac.
Jim Nasby <nasbyj@amazon.com> writes:
I'm wondering if it'd be worth linking autovac wakeup from a truly idle
state to the stats collector. If there's no stats messages coming in
clearly there's nothing new for autovac.
That seems pretty scary in the current system design, where the
stats collector is intentionally not 100% reliable (and sometimes,
less intentionally, it fails completely). If we get to a place
where we're willing to bank on stats being collected 100% of the
time, it might make sense.
regards, tom lane
Hi,
Replication never got the memo, so power consumption on an
idle server is not very effective on standby or logical subscribers.
The code and timing for hibernation is also different for each worker,
which is confusing.
Agree, this patch makes it easier to understand the hibernation
behavior of various workers.
1. Standardize the hibernation time at 60s, using a #define
HIBERNATE_DELAY_SEC 60
I notice in patch 3 HIBERNATE_DELAY_SEC has been increased to 300
seconds, what’s the reasoning behind it? Is longer hibernation delay
better? If so can we set it to INT_MAX (the max timeout allowed by
WaitLatch()) in which case a worker in hibernation only relies on
wakeup? I think it would be nice to run experiments to verify that the
patch reduces power consumption while varying the value of
HIBERNATE_DELAY_SEC.
Regards,
Zheng Li
Amazon RDS/Aurora for PostgreSQL
Show quoted text
On Thu, Mar 3, 2022 at 5:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <nasbyj@amazon.com> writes:
I'm wondering if it'd be worth linking autovac wakeup from a truly idle
state to the stats collector. If there's no stats messages coming in
clearly there's nothing new for autovac.That seems pretty scary in the current system design, where the
stats collector is intentionally not 100% reliable (and sometimes,
less intentionally, it fails completely). If we get to a place
where we're willing to bank on stats being collected 100% of the
time, it might make sense.regards, tom lane
On Sat, 26 Feb 2022 at 17:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
Deprecating explicit file-based promotion is possible and simple, so
that is the approach in the latest version of the patch.Is there any actual use-case for this other than backwards
compatibility?The fundamental problem with signal-based promotion is that it's
an edge-triggered rather than level-triggered condition, ie if you
miss the single notification event you're screwed. I'm not sure
why we'd want to go there, considering all the problems we're having
with edge-triggered APIs in nearby threads.We could combine the two approaches, perhaps: have "pg_ctl promote"
create a trigger file and then send a signal. The trigger file
would record the existence of the promotion request, so that if the
postmaster isn't running right now or misses the signal, it'd still
promote when restarted or otherwise at the next opportunity.
But with an approach like this, we could expect quick response to
the signal in normal conditions, without need for constant wakeups
to check for the file. Also, this route would allow overloading
of the signal, since it would just mean "wake up, I asked you to
do something" rather than needing to carry the full semantics of
what is to be done.
The above is how this works now, luckily, but few comments explain why.
This current state evolved because I first added file-based promotion,
for the reasons you give, but it was slow, so the signal approach was
added later, with new API.
What we are discussing deprecating is the ability to create the
trigger file directly (the original API). Once that is gone the
mechanism would be to request promotion via pg_ctl promote (the new
API), which then creates the trigger file and sends a signal. So in
both APIs the trigger file is still used.
In this patch, if you create the trigger file
directly/explicitly/yourself (the old API) then it will still trigger
when hibernating, but it will be slower. The new API is unaffected by
this change.
--
Simon Riggs http://www.EnterpriseDB.com/
On Wed, 9 Mar 2022 at 01:16, Zheng Li <zhengli10@gmail.com> wrote:
1. Standardize the hibernation time at 60s, using a #define
HIBERNATE_DELAY_SEC 60I notice in patch 3 HIBERNATE_DELAY_SEC has been increased to 300
seconds, what’s the reasoning behind it? Is longer hibernation delay
better? If so can we set it to INT_MAX (the max timeout allowed by
WaitLatch()) in which case a worker in hibernation only relies on
wakeup? I think it would be nice to run experiments to verify that the
patch reduces power consumption while varying the value of
HIBERNATE_DELAY_SEC.
Setting it to INT_MAX would be the same as not allowing a timeout,
which changes a lot of current behavior and makes it less robust.
Waking once per minute is what we do in various cases, so 60sec is a
good choice.
In the case of logical rep launcher we currently use 300sec, so using
60s would decrease this.
I don't see much difference between power consumption with timeouts of
60s and 300s.
In the latest patch, I chose 300s. Does anyone have an opinion on the
value here?
--
Simon Riggs http://www.EnterpriseDB.com/
Hi,
On 2022-03-10 17:50:47 +0000, Simon Riggs wrote:
On Wed, 9 Mar 2022 at 01:16, Zheng Li <zhengli10@gmail.com> wrote:
1. Standardize the hibernation time at 60s, using a #define
HIBERNATE_DELAY_SEC 60I notice in patch 3 HIBERNATE_DELAY_SEC has been increased to 300
seconds, what’s the reasoning behind it? Is longer hibernation delay
better? If so can we set it to INT_MAX (the max timeout allowed by
WaitLatch()) in which case a worker in hibernation only relies on
wakeup? I think it would be nice to run experiments to verify that the
patch reduces power consumption while varying the value of
HIBERNATE_DELAY_SEC.Setting it to INT_MAX would be the same as not allowing a timeout,
which changes a lot of current behavior and makes it less robust.
Most of these timeouts are a bad idea and should not exist. We repeatedly have
had bugs where we were missing wakeups etc but those bugs were harder to
notice because of timeouts. I'm against entrenching this stuff further.
Greetings,
Andres Freund
On 2022-02-21 21:04:19 +0000, Simon Riggs wrote:
On Mon, 21 Feb 2022 at 16:49, Chapman Flack <chap@anastigmatix.net> wrote:
Shouldn't the comment be "with work_done=false" ?
Good catch, thanks.
I've also added docs to say that "promote_trigger_file" is now
deprecated. There were no tests for that functionality, so just as
well it is being removed.
Doesn't pass tests, and hasn't for weeks from what I can see: https://cirrus-ci.com/task/5925633648754688?logs=test_world#L1153
Marked as waiting-on-author.
- Andres
On Tue, 22 Mar 2022 at 00:54, Andres Freund <andres@anarazel.de> wrote:
On 2022-02-21 21:04:19 +0000, Simon Riggs wrote:
On Mon, 21 Feb 2022 at 16:49, Chapman Flack <chap@anastigmatix.net> wrote:
Shouldn't the comment be "with work_done=false" ?
Good catch, thanks.
I've also added docs to say that "promote_trigger_file" is now
deprecated. There were no tests for that functionality, so just as
well it is being removed.Doesn't pass tests, and hasn't for weeks from what I can see: https://cirrus-ci.com/task/5925633648754688?logs=test_world#L1153
Marked as waiting-on-author.
Hmm, just the not-in-sample test again. Fixed by marking GUC_NOT_IN_SAMPLE.
--
Simon Riggs http://www.EnterpriseDB.com/
Attachments:
hibernate_to_reduce_power_consumption.v4.patchapplication/octet-stream; name=hibernate_to_reduce_power_consumption.v4.patchDownload+137-88
At Thu, 10 Mar 2022 11:45:10 -0800, Andres Freund <andres@anarazel.de> wrote in
Hi,
On 2022-03-10 17:50:47 +0000, Simon Riggs wrote:
On Wed, 9 Mar 2022 at 01:16, Zheng Li <zhengli10@gmail.com> wrote:
1. Standardize the hibernation time at 60s, using a #define
HIBERNATE_DELAY_SEC 60I notice in patch 3 HIBERNATE_DELAY_SEC has been increased to 300
seconds, what’s the reasoning behind it? Is longer hibernation delay
better? If so can we set it to INT_MAX (the max timeout allowed by
WaitLatch()) in which case a worker in hibernation only relies on
wakeup? I think it would be nice to run experiments to verify that the
patch reduces power consumption while varying the value of
HIBERNATE_DELAY_SEC.Setting it to INT_MAX would be the same as not allowing a timeout,
which changes a lot of current behavior and makes it less robust.Most of these timeouts are a bad idea and should not exist. We repeatedly have
had bugs where we were missing wakeups etc but those bugs were harder to
I basically agree to this.
notice because of timeouts. I'm against entrenching this stuff further.
For example, pgarch.c theoretically doesn't need hibernate, since it
has an apparent trigger event and a terminal condition. What we need
to do about archiver is to set timeout only when it didn't reach the
lastest finished segment at an iteration. (this might need additional
shared memory use, though..)
I'm not sure about bgwriter, walwriter and logical replication stuff...
About walreciver,
- if ((now - startTime) > WALRCV_STARTUP_TIMEOUT)
+ if ((now - startTime) > wal_receiver_timeout)
This is simply a misuse of the timeout. WALRCV_STARTP_TIMEOUT is not
used for a sleep and irrelevant to receiving data from the peer
walsender. wal_receiver_timeout's default is 60s but we don't assume
that walreceiver takes that long time to start up. (I think Thomas is
wroking on the walreceiver timeout stuff.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, 24 Mar 2022 at 07:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Most of these timeouts are a bad idea and should not exist. We repeatedly have
had bugs where we were missing wakeups etc but those bugs were harder toI basically agree to this.
As a general point, maybe. But we have a lot of procs doing a lot of
polling and we are unlikely to change that anytime soon.
notice because of timeouts. I'm against entrenching this stuff further.
For example, pgarch.c theoretically doesn't need hibernate, since it
has an apparent trigger event and a terminal condition. What we need
to do about archiver is to set timeout only when it didn't reach the
lastest finished segment at an iteration. (this might need additional
shared memory use, though..)
archiver is not the important thing here. If there is objection to
that section of code we can remove that.
I'm not sure about bgwriter, walwriter and logical replication stuff...
I am sure. bgwriter, walwriter and logical worker need action from us
to allow them to hibernate in a sensible way that allows powersaving.
I think Thomas is wroking on the walreceiver timeout stuff.
Yes, he is. I have no problem going with what he advocates, if others agree.
However, that fix does not solve the whole problem, since many other
procs also need fixing.
The proposals of this patch are the following, each of which can be
independently accepted/rejected:
1. fix the sleep pattern of bgwriter, walwriter and logical worker
(directly affects powersave)
2. deprecate promote_trigger_file, which then allows us to fix the
sleep for startup process (directly affects powersave)
3. treat hibernation in all procs the same, for simplicity, and to
make sure we don't forget one later
4. provide a design pattern for background worker extensions to
follow, so as to encourage powersaving
Please don't reject the whole patch because you disagree with part
(3), it is not that important.
Thanks
--
Simon Riggs http://www.EnterpriseDB.com/
On Thu, Mar 24, 2022 at 6:59 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
The proposals of this patch are the following, each of which can be
independently accepted/rejected:
1. fix the sleep pattern of bgwriter, walwriter and logical worker
(directly affects powersave)
2. deprecate promote_trigger_file, which then allows us to fix the
sleep for startup process (directly affects powersave)
3. treat hibernation in all procs the same, for simplicity, and to
make sure we don't forget one later
4. provide a design pattern for background worker extensions to
follow, so as to encourage powersaving
Unfortunately, the patch isn't split in a way that corresponds to this
division. I think I'm +1 on change #2 -- deprecating
promote_trigger_file seems like a good idea to me independently of any
power-saving considerations. But I'm not sure that I am on board with
any of the other changes. I do agree with the basic goal of trying to
reduce unnecessary wakeups, but I think the rest of the patch is
taking a bit of a one-size-fits-all approach where I think that we
might want to be more nuanced. I think there are a couple of different
kinds of cases here.
In some places, like DetermineSleepTime(), we're already willing to
sleep for pretty long periods of time, like a minute. I think in those
cases we could consider doing nothing, on the theory that one wakeup
per minute is already not very much. If we do want to do something, I
think it should be in the direction of convincing ourselves that we
don't need a timeout at all. Having a timeout is a bit like insurance:
it guarantees that if for some reason the event by which we expect to
be awoken doesn't actually wake us up even though something meaningful
has happened, we won't hang forever. But if we think a wakeup per
minute is meaningful and we don't think there's any possibility of a
missed wakeup, let's just wait forever. I don't think we'll avoid many
user complaints by recovering from a missed wakeup in just under an
hour.
In other places, like WalWriterMain, we're basically polling. One
question in these kinds of cases is whether we can avoid polling in
favor of having some other process wake us up if the event that we
care about happens. This is unlikely to be practical in all cases -
e.g. we can't wait for a promotion trigger file to show up unless we
want to use inotify or something like that. However, we may be able to
avoid polling in some instances. When we can't, then I think it makes
sense to increase the wait time when the system appears to be idle. In
that subset of cases - we're polling and we can't avoid polling - this
kind of approach seems fairly reasonable.
I do have some concerns about the idea that the right strategy in
general, or even in particular cases, is to multiply by 50. This patch
hasn't invented that problem; it's already there. My concern is that
multiplying a small number by 50 seems very different than multiplying
a large number by 50. If we normally wait for 100ms before checking
for something to happen, and we wait for 5s, it's probably not going
to be a huge issue, because 5s is still a short amount of time for
human beings. If we normally wait for a minute before checking for
something to happen and we wait for 50 minutes, that's much more
likely to make a human being unhappy. Therefore, it's very unclear to
me that those cases should be treated the same way.
There's a more technical issue with this strategy, too: if we multiply
both short and long timeouts by 50, I think we are going to get pretty
much the same result as if we only multiply the short timeouts by 50.
Why even bother reducing the frequency of wakeups from minutes to
hours if we're elsewhere reducing the frequency from seconds to
minutes? If process A is still waking up every minute, putting process
B in the deep freeze isn't going to do a whole lot in terms of letting
the system go into any kind of deeper sleep.
All in all I feel that encouraging developers to adopt a
multiply-by-50 rule in all cases seems too simplistic to me. It may be
better than what we're doing right now, but it doesn't really seem
like the right policy.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Thu, 24 Mar 2022 at 15:39, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 24, 2022 at 6:59 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:The proposals of this patch are the following, each of which can be
independently accepted/rejected:
1. fix the sleep pattern of bgwriter, walwriter and logical worker
(directly affects powersave)
2. deprecate promote_trigger_file, which then allows us to fix the
sleep for startup process (directly affects powersave)
3. treat hibernation in all procs the same, for simplicity, and to
make sure we don't forget one later
4. provide a design pattern for background worker extensions to
follow, so as to encourage powersavingUnfortunately, the patch isn't split in a way that corresponds to this
division. I think I'm +1 on change #2 -- deprecating
promote_trigger_file seems like a good idea to me independently of any
power-saving considerations. But I'm not sure that I am on board with
any of the other changes.
OK, so change (2) is good. Separate patch attached.
I do agree with the basic goal of trying to
reduce unnecessary wakeups, but I think the rest of the patch is
taking a bit of a one-size-fits-all approach where I think that we
might want to be more nuanced. I think there are a couple of different
kinds of cases here.
OK, so you disagree with (3) and probably (4). No problem.
What about (1)? That directly affects the powersave capability. I
didn't read anything specific to that.
If we don't fix (1) as well, the changes for startup and walreceiver
will be ineffective for powersaving.
What changes will be acceptable for bgwriter, walwriter and logical worker?
--
Simon Riggs http://www.EnterpriseDB.com/