Fast promotion not used when doing a recovery_target PITR restore?

Started by Andres Freundalmost 9 years ago8 messageshackers

andres@anarazel.de

almost 9 years ago

Hi,

When doing a PITR style recovery, with recovery target set, we're
currently not doing a fast promotion, in contrast to the handling when
doing a pg_ctl or trigger file based promotion. That can prolong making
the server available for writes.

I can't really see a reason for this?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

almost 9 years ago

In reply to: Andres Freund (#1)

Re: Fast promotion not used when doing a recovery_target PITR restore?

On Thu, Jun 22, 2017 at 3:04 AM, Andres Freund <andres@anarazel.de> wrote:

When doing a PITR style recovery, with recovery target set, we're
currently not doing a fast promotion, in contrast to the handling when
doing a pg_ctl or trigger file based promotion. That can prolong making
the server available for writes.

I can't really see a reason for this?

Yes, you are right. I see no reason either why this cannot be done.
Why not just switching fast_promote to true in when using
RECOVERY_TARGET_ACTION_PROMOTE? That's a bug, not a critical one
though.
--
Michael

andres@anarazel.de

almost 9 years ago

In reply to: Michael Paquier (#2)

Re: Fast promotion not used when doing a recovery_target PITR restore?

On 2017-06-22 14:04:42 +0900, Michael Paquier wrote:

On Thu, Jun 22, 2017 at 3:04 AM, Andres Freund <andres@anarazel.de> wrote:

When doing a PITR style recovery, with recovery target set, we're
currently not doing a fast promotion, in contrast to the handling when
doing a pg_ctl or trigger file based promotion. That can prolong making
the server available for writes.

I can't really see a reason for this?

Yes, you are right. I see no reason either why this cannot be done.
Why not just switching fast_promote to true in when using
RECOVERY_TARGET_ACTION_PROMOTE? That's a bug, not a critical one
though.

I don't think it's really a bug - just a missed optimization. I'd
personally not be in favor of backpatching this - it'll have some chance
of screwing things up, even if I hope that chance is fairly small.

As a wider discussion, I wonder if we should keep non-fast promotion for
anything but actual crash recovery? And even there it might actually be
a pretty good idea to not force a full checkpoint - getting up fast
after a crash is kinda important..

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

almost 9 years ago

In reply to: Andres Freund (#3)

Re: Fast promotion not used when doing a recovery_target PITR restore?

On Fri, Jun 23, 2017 at 2:34 AM, Andres Freund <andres@anarazel.de> wrote:

I don't think it's really a bug - just a missed optimization. I'd
personally not be in favor of backpatching this - it'll have some chance
of screwing things up, even if I hope that chance is fairly small.

It would be better to wait until the branch for PG11 opens then.

As a wider discussion, I wonder if we should keep non-fast promotion for
anything but actual crash recovery?

Yes, I would push a bit forward and remove fallback_promote.

And even there it might actually be
a pretty good idea to not force a full checkpoint - getting up fast
after a crash is kinda important..

But not that. Crash recovery is designed to be simple and robust, with
only the postmaster and the startup processes running when doing so.
Not having the startup process doing by itself checkpoints would
require the need of the bgwriter, which increases the likelihood of
bugs. In short, I don't think that improving performance is the matter
for crash recovery, robustness and simplicity are.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

andres@anarazel.de

almost 9 years ago

In reply to: Michael Paquier (#4)

Re: Fast promotion not used when doing a recovery_target PITR restore?

On 2017-06-23 10:56:07 +0900, Michael Paquier wrote:

And even there it might actually be
a pretty good idea to not force a full checkpoint - getting up fast
after a crash is kinda important..

But not that. Crash recovery is designed to be simple and robust, with
only the postmaster and the startup processes running when doing so.
Not having the startup process doing by itself checkpoints would
require the need of the bgwriter, which increases the likelihood of
bugs. In short, I don't think that improving performance is the matter
for crash recovery, robustness and simplicity are.

I'm far from convinced by this. By now WAL replay with checkpointer,
bgwriter, etc. active is actually *more* tested than the cases without
it. The likelihood of bugs is higher in the less frequently exercised
paths, and given that replication exercises the situation with all those
processes active on a continuous basis, I'm fairly unconvinced by your
argument.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

almost 9 years ago

In reply to: Andres Freund (#5)

Re: Fast promotion not used when doing a recovery_target PITR restore?

On Wed, Jun 28, 2017 at 3:44 AM, Andres Freund <andres@anarazel.de> wrote:

I'm far from convinced by this. By now WAL replay with checkpointer,
bgwriter, etc. active is actually *more* tested than the cases without
it. The likelihood of bugs is higher in the less frequently exercised
paths, and given that replication exercises the situation with all those
processes active on a continuous basis, I'm fairly unconvinced by your
argument.

Crash recovery is the last thing where failures should never happen.
Don't you think that it should remain simple as it has been designed
originally? It seems to me that the argument for keeping things simple
has higher priority than performance in being able to reconnect by
delaying the checkpoint.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

andres@anarazel.de

almost 9 years ago

In reply to: Michael Paquier (#6)

Re: Fast promotion not used when doing a recovery_target PITR restore?

On 2017-06-28 06:04:23 +0900, Michael Paquier wrote:

On Wed, Jun 28, 2017 at 3:44 AM, Andres Freund <andres@anarazel.de> wrote:

I'm far from convinced by this. By now WAL replay with checkpointer,
bgwriter, etc. active is actually *more* tested than the cases without
it. The likelihood of bugs is higher in the less frequently exercised
paths, and given that replication exercises the situation with all those
processes active on a continuous basis, I'm fairly unconvinced by your
argument.

Crash recovery is the last thing where failures should never happen.
Don't you think that it should remain simple as it has been designed
originally? It seems to me that the argument for keeping things simple
has higher priority than performance in being able to reconnect by
delaying the checkpoint.

You seem to completely argue besides my point that the replication path
is *more* robust by now? And there's plenty scenarios where a faster
startup is quite crucial for performance. The difference between an
immediate shutdown + recovery without checkpoint to a fast shutdown can
be very large, and that matters a lot for faster postgres updates etc.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

almost 9 years ago

In reply to: Andres Freund (#7)

Re: Fast promotion not used when doing a recovery_target PITR restore?

On Wed, Jun 28, 2017 at 6:13 AM, Andres Freund <andres@anarazel.de> wrote:

You seem to completely argue besides my point that the replication path
is *more* robust by now? And there's plenty scenarios where a faster
startup is quite crucial for performance. The difference between an
immediate shutdown + recovery without checkpoint to a fast shutdown can
be very large, and that matters a lot for faster postgres updates etc.

If you go that way, it seems safer to me if users had some control
with a switch, defaulting to the previous behavior. And a complete
switch to the newer behavior could be done later on depending on what
has been found.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers