Should we remove a fallback promotion? take 2
Hi,
We discussed the $SUBJECT six years ago at the following thread.
/messages/by-id/CAHGQGwGYkF+CvpOMdxaO=+aNAzc1Oo9O4LqWo50MxpvFj+0VOw@mail.gmail.com
Seems our consensus at that discussion was to leave a fallback
promotion for a release or two for debugging purpose or as
an emergency method because fast promotion might have
some issues, and then to remove it later. Now, more than six years
have already passed since that discussion. Is there still
any reason to keep a fallback promotion? If nothing, I'd like to
drop it from v13.
Regards,
--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
On Thu, Mar 5, 2020 at 8:48 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
We discussed the $SUBJECT six years ago at the following thread.
/messages/by-id/CAHGQGwGYkF+CvpOMdxaO=+aNAzc1Oo9O4LqWo50MxpvFj+0VOw@mail.gmail.comSeems our consensus at that discussion was to leave a fallback
promotion for a release or two for debugging purpose or as
an emergency method because fast promotion might have
some issues, and then to remove it later. Now, more than six years
have already passed since that discussion. Is there still
any reason to keep a fallback promotion? If nothing, I'd like to
drop it from v13.
Seems reasonable, but it would be better if people proposed these
kinds of changes closer to the beginning of the release cycle rather
than in the crush at the end.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote:
Seems reasonable, but it would be better if people proposed these
kinds of changes closer to the beginning of the release cycle rather
than in the crush at the end.
+1, to both points.
--
Michael
On 2020/03/06 10:40, Michael Paquier wrote:
On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote:
Seems reasonable, but it would be better if people proposed these
kinds of changes closer to the beginning of the release cycle rather
than in the crush at the end.+1, to both points.
Ok, I'm fine to do that in v14.
Regards,
--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
On 2020-Mar-06, Michael Paquier wrote:
On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote:
Seems reasonable, but it would be better if people proposed these
kinds of changes closer to the beginning of the release cycle rather
than in the crush at the end.+1, to both points.
Why? Are you saying that there's some actual risk of breaking
something? We're not even near beta or feature freeze yet.
I'm not seeing the reason for the "please propose this sooner in the
cycle" argument. It has already been proposed sooner -- seven years
sooner. We're not waiting for users to complain anymore; clearly nobody
cared.
I think dragging things forever serves no purpose.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
On 2020-03-06 16:33:18 -0300, Alvaro Herrera wrote:
On 2020-Mar-06, Michael Paquier wrote:
On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote:
Seems reasonable, but it would be better if people proposed these
kinds of changes closer to the beginning of the release cycle rather
than in the crush at the end.+1, to both points.
Why? Are you saying that there's some actual risk of breaking
something? We're not even near beta or feature freeze yet.I'm not seeing the reason for the "please propose this sooner in the
cycle" argument. It has already been proposed sooner -- seven years
sooner. We're not waiting for users to complain anymore; clearly nobody
cared.
Yea. There are changes that are so invasive that it's useful to go very
early, but in this case I'm not seeing it?
+1 for removing non-fast promotions.
FWIW, I find "fallback promotion" a confusing description.
Btw, I'd really like to make the crash recovery environment more like
the replication environment. I.e. have checkpointer, bgwriter running,
and have an 'end-of-recovery' record instead of a checkpoint at the end.
Greetings,
Andres Freund
On 2020/03/10 6:56, Andres Freund wrote:
Hi,
On 2020-03-06 16:33:18 -0300, Alvaro Herrera wrote:
On 2020-Mar-06, Michael Paquier wrote:
On Thu, Mar 05, 2020 at 09:40:54AM -0500, Robert Haas wrote:
Seems reasonable, but it would be better if people proposed these
kinds of changes closer to the beginning of the release cycle rather
than in the crush at the end.+1, to both points.
Why? Are you saying that there's some actual risk of breaking
something? We're not even near beta or feature freeze yet.I'm not seeing the reason for the "please propose this sooner in the
cycle" argument. It has already been proposed sooner -- seven years
sooner. We're not waiting for users to complain anymore; clearly nobody
cared.Yea. There are changes that are so invasive that it's useful to go very
early, but in this case I'm not seeing it?+1 for removing non-fast promotions.
Patch attached. I will add this into the first CF for v14.
FWIW, I find "fallback promotion" a confusing description.
Yeah, so I changed the subject.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Attachments:
drop_non_fast_promotion_v1.patchtext/plain; charset=UTF-8; name=drop_non_fast_promotion_v1.patch; x-mac-creator=0; x-mac-type=0Download+15-41
On Mon, Apr 20, 2020 at 03:26:16PM +0900, Fujii Masao wrote:
Patch attached. I will add this into the first CF for v14.
Thanks!
- if (IsPromoteSignaled()) + /* + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the + * signal handler. It now leaves the file in place and lets the + * Startup process do the unlink. + */ + if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) { - /* - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the - * signal handler. It now leaves the file in place and lets the - * Startup process do the unlink. This allows Startup to know whether - * it should create a full checkpoint before starting up (fallback - * mode). Fast promotion takes precedence. - */ - if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) - { - unlink(PROMOTE_SIGNAL_FILE); - unlink(FALLBACK_PROMOTE_SIGNAL_FILE); - fast_promote = true; - } - else if (stat(FALLBACK_PROMOTE_SIGNAL_FILE, &stat_buf) == 0) - { - unlink(FALLBACK_PROMOTE_SIGNAL_FILE); - fast_promote = false; - } - ereport(LOG, (errmsg("received promote request"))); - + unlink(PROMOTE_SIGNAL_FILE);
On HEAD, this code means that it is possible to end recovery just by
sending SIGUSR2 to the startup process. With your patch, this code
now means that in order to finish recovery you need to send SIGUSR2 to
the startup process *and* to create the promote signal file. Is that
really what you want?
--
Michael
On 2020/04/21 10:59, Michael Paquier wrote:
On Mon, Apr 20, 2020 at 03:26:16PM +0900, Fujii Masao wrote:
Patch attached. I will add this into the first CF for v14.
Thanks!
- if (IsPromoteSignaled()) + /* + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the + * signal handler. It now leaves the file in place and lets the + * Startup process do the unlink. + */ + if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) { - /* - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the - * signal handler. It now leaves the file in place and lets the - * Startup process do the unlink. This allows Startup to know whether - * it should create a full checkpoint before starting up (fallback - * mode). Fast promotion takes precedence. - */ - if (stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) - { - unlink(PROMOTE_SIGNAL_FILE); - unlink(FALLBACK_PROMOTE_SIGNAL_FILE); - fast_promote = true; - } - else if (stat(FALLBACK_PROMOTE_SIGNAL_FILE, &stat_buf) == 0) - { - unlink(FALLBACK_PROMOTE_SIGNAL_FILE); - fast_promote = false; - } - ereport(LOG, (errmsg("received promote request"))); - + unlink(PROMOTE_SIGNAL_FILE);
Thanks for reviewing the patch!
On HEAD, this code means that it is possible to end recovery just by
sending SIGUSR2 to the startup process.
Yes, in this case, non-fast promotion is triggered.
With your patch, this code
now means that in order to finish recovery you need to send SIGUSR2 to
the startup process *and* to create the promote signal file.
Yes, but isn't this the same as the way to trigger fast promotion in HEAD?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Tue, Apr 21, 2020 at 02:27:20PM +0900, Fujii Masao wrote:
On 2020/04/21 10:59, Michael Paquier wrote:
With your patch, this code
now means that in order to finish recovery you need to send SIGUSR2 to
the startup process *and* to create the promote signal file.Yes, but isn't this the same as the way to trigger fast promotion in HEAD?
Yep, but my point is that some users who have been relying only on
SIGUSR2 sent to the startup process for a promotion may be surprised
to see that doing the same operation does not trigger a promotion
anymore.
--
Michael
On 2020/04/21 14:54, Michael Paquier wrote:
On Tue, Apr 21, 2020 at 02:27:20PM +0900, Fujii Masao wrote:
On 2020/04/21 10:59, Michael Paquier wrote:
With your patch, this code
now means that in order to finish recovery you need to send SIGUSR2 to
the startup process *and* to create the promote signal file.Yes, but isn't this the same as the way to trigger fast promotion in HEAD?
Yep, but my point is that some users who have been relying only on
SIGUSR2 sent to the startup process for a promotion may be surprised
to see that doing the same operation does not trigger a promotion
anymore.
Yeah, but that's not documented. So I don't think that we need to keep
the backward-compatibility for that.
Also in that case, non-fast promotion is triggered. Since my patch
tries to remove non-fast promotion, it's intentional to prevent them
from doing that. But you think that we should not drop that because
there are still some users for that?
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote:
Yeah, but that's not documented. So I don't think that we need to keep
the backward-compatibility for that.Also in that case, non-fast promotion is triggered. Since my patch
tries to remove non-fast promotion, it's intentional to prevent them
from doing that. But you think that we should not drop that because
there are still some users for that?
It would be good to ask around to folks maintaining HA solutions about
that change at least, as there could be a point in still letting
promotion to happen in this case, but switch silently to the fast
path.
--
Michael
On 2020/04/21 15:36, Michael Paquier wrote:
On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote:
Yeah, but that's not documented. So I don't think that we need to keep
the backward-compatibility for that.Also in that case, non-fast promotion is triggered. Since my patch
tries to remove non-fast promotion, it's intentional to prevent them
from doing that. But you think that we should not drop that because
there are still some users for that?It would be good to ask around to folks maintaining HA solutions about
that change at least, as there could be a point in still letting
promotion to happen in this case, but switch silently to the fast
path.
*If* there are some HA solutions doing that, IMO that they should be changed
so that the documented official way to trigger promotion (i.e., pg_ctl promote,
pg_promote or trigger_file) is used instead.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
At Tue, 21 Apr 2020 15:48:02 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
On 2020/04/21 15:36, Michael Paquier wrote:
On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote:
Yeah, but that's not documented. So I don't think that we need to keep
the backward-compatibility for that.Also in that case, non-fast promotion is triggered. Since my patch
tries to remove non-fast promotion, it's intentional to prevent them
from doing that. But you think that we should not drop that because
there are still some users for that?It would be good to ask around to folks maintaining HA solutions about
that change at least, as there could be a point in still letting
promotion to happen in this case, but switch silently to the fast
path.*If* there are some HA solutions doing that, IMO that they should be
*changed
so that the documented official way to trigger promotion (i.e., pg_ctl
promote,
pg_promote or trigger_file) is used instead.
The difference between fast and non-fast promotions is far trivial
than the difference between promotion happens or not. I think
everyone cares about the new version actually promotes by the steps
they have been doing, but few of them even notices the difference
between the fast and non-fast. If those who are using non-fast
promotion for a certain reason should notice the change of promotion
behavior in release notes.
This is similar to the change of the default waiting behvaior of
pg_ctl at PG10.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
Patch attached. I will add this into the first CF for v14.
- if (!fast_promoted)
+ if (!promoted)
RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_IMMEDIATE |
CHECKPOINT_WAIT);
If we don't find the checkpoint record just before, we don't insert
End-Of-Recovery record then run an immediate chekpoint. I think if we
nuke the non-fast promotion, shouldn't we insert the EOR record even
in that case?
Or, as Andres suggested upthread, do we always insert it?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020/04/21 17:15, Kyotaro Horiguchi wrote:
At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
Patch attached. I will add this into the first CF for v14.
- if (!fast_promoted)
+ if (!promoted)
RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_IMMEDIATE |
CHECKPOINT_WAIT);If we don't find the checkpoint record just before, we don't insert
End-Of-Recovery record then run an immediate chekpoint. I think if we
nuke the non-fast promotion, shouldn't we insert the EOR record even
in that case?
I'm not sure if that's safe. What if the server crashes before the checkpoint
completes in that case? Since the last checkpoint record is not available,
the subsequent crash recovery will fail. This would lead to that the server
will never start up. Right? Currently ISTM that end-of-recovery-checkpoint
is executed to avoid such trouble in that case.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Hello,
On Tue, 21 Apr 2020 15:36:22 +0900
Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Apr 21, 2020 at 03:29:54PM +0900, Fujii Masao wrote:
Yeah, but that's not documented. So I don't think that we need to keep
the backward-compatibility for that.Also in that case, non-fast promotion is triggered. Since my patch
tries to remove non-fast promotion, it's intentional to prevent them
from doing that. But you think that we should not drop that because
there are still some users for that?It would be good to ask around to folks maintaining HA solutions about
that change at least, as there could be a point in still letting
promotion to happen in this case, but switch silently to the fast
path.
FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion.
Regards,
On 2020-Apr-21, Jehan-Guillaume de Rorthais wrote:
On Tue, 21 Apr 2020 15:36:22 +0900
Michael Paquier <michael@paquier.xyz> wrote:
Also in that case, non-fast promotion is triggered. Since my patch
tries to remove non-fast promotion, it's intentional to prevent them
from doing that. But you think that we should not drop that because
there are still some users for that?It would be good to ask around to folks maintaining HA solutions about
that change at least, as there could be a point in still letting
promotion to happen in this case, but switch silently to the fast
path.FWIW, PAF relies on pg_ctl promote. No need for non-fast promotion.
AFAICT repmgr uses 'pg_ctl promote', and has since version 3.0 (released
in mid 2015). It was only 3.3.2 (mid 2017) that supported Postgres 10,
so it seems fairly safe to assume that the removal won't be a problem.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Apr-20, Fujii Masao wrote:
+ /* + * In 9.1 and 9.2 the postmaster unlinked the promote file inside the + * signal handler. It now leaves the file in place and lets the + * Startup process do the unlink. + */ + if (IsPromoteSignaled() && stat(PROMOTE_SIGNAL_FILE, &stat_buf) == 0) { - /* - * In 9.1 and 9.2 the postmaster unlinked the promote file inside the - * signal handler. It now leaves the file in place and lets the - * Startup process do the unlink. This allows Startup to know whether - * it should create a full checkpoint before starting up (fallback - * mode). Fast promotion takes precedence. - */
It seems pointless to leave a very old comment that documents what the
code no longer does. I thikn it would be better to reword it indicating
what the code does do, ie. something like "Leave the signal file in
place; it will be removed by the startup process when ..."
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Tue, 21 Apr 2020 22:08:56 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
On 2020/04/21 17:15, Kyotaro Horiguchi wrote:
At Mon, 20 Apr 2020 15:26:16 +0900, Fujii Masao
<masao.fujii@oss.nttdata.com> wrote inPatch attached. I will add this into the first CF for v14.
- if (!fast_promoted)
+ if (!promoted)
RequestCheckpoint(CHECKPOINT_END_OF_RECOVERY |
CHECKPOINT_IMMEDIATE |
CHECKPOINT_WAIT);
If we don't find the checkpoint record just before, we don't insert
End-Of-Recovery record then run an immediate chekpoint. I think if we
nuke the non-fast promotion, shouldn't we insert the EOR record even
in that case?I'm not sure if that's safe. What if the server crashes before the
checkpoint
completes in that case? Since the last checkpoint record is not
available,
the subsequent crash recovery will fail. This would lead to that the
server
will never start up. Right? Currently ISTM that
Yes, that's right.
end-of-recovery-checkpoint
is executed to avoid such trouble in that case.
I meant that we always have EOR at the end of recovery. So in the
missing latest checkpoint (and crash recovery) case, we insert EOR
after the immediate checkpoint. That also means we no longer set
CHECKPOINT_END_OF_RECOVERY to the checkpoint, too.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center